Finding the right data is hard.
Sound familiar? This was a quote from our 2018 Analytic Data Survey. It showed that only 41% of respondants trusted and understood the data they consumed. At that time, data was directly replicated into our on-prem Hadoop lake from production databases. Then, thousands of long-running undocumented data processes ran on these often misunderstood, constantly changing raw data tables. The raw data tables were tightly coupled to the source systems and had no ownership. The result was an unorganized and incoherent mess of 166,685 tables across 979 databases. Our lake had become a swamp and something had to change. Our solution was to organize the data lake into distinct layers with clear delineations of ownership, business logic, and usage; in other words, a layered architecture.
As an almost 20 year old company in 2019, we needed to make some changes to become the data-driven company we aspired to be. Implementing a layered architecture data model would be the first major step. The concept of layered architecture isn’t a unique concept in the world of data, but how each company implements their architecture can vary. Databricks describes a medallion architechture consisting of three layers. We implemented a four layer model that best suited our data and needs.
The following sections define each layer and their functions in our layered architecture.
The source is where the raw data originates from. It can be a SQL database (like MySQL or MSSQL), a NoSQL database (like DynamoDB), or files stored in a SFTP server. Data within the source is often non-replayable (no versioning is maintained) because it is an operational system that is mainly concerned with the current state.
The Raw layer is the first layer of the data lake and it ingests data, as-is from the source system. It serves as private replayable storage for the data producer. For example, if the data is ingressed on a daily cadence then they can look back at two previous partitions to know the state of the data two days ago. This is extremely helpful with operation tasks (like backfilling and debugging cleansing logic).
The Clean layer stores filtered, schematized, relationalized, standardized, compacted, compressed, and cleaned data for easier integration with other data sources. This layer provides an abstraction from the source system schema so it can evolve without breaking downstream consumers.
The Enterprise layer integrates the data from the Clean layer to build conformed facts and dimensions. It abstracts consumers from the entire notion of source systems and frames everything into an Enterprise context. It contains data from multiple systems updated with governed enterprise business logic (like rules to classify customers) to reduce duplicative processing and ensure consistency in definitions. This serves of the foundational layer for most downstream processing.
The Analytical Layer is the final layer of the data lake and consolidates the conformed data from the Enterprise layer into easy to use cubes. It usually combines several enterprise tables used for business analysis and data science. Data in the Analytical layer are usually in highly denormalized form, contain aggregated data, and may contain some non-governed domain-specific logic.
As we go deeper in the data lake there are less joins, making the ease-of-use higher. However, the trade-off is higher latency and higher risk of outages due to the high number of dependencies. The following diagram illustrates these trade-offs for data consumers as they go deeper into the data lake.
Ultimately, clearly defining the data in each layer helps consumers determine the best place to source data for their use case. For some use cases, reducing latency is of the upmost importance so consumers may opt to consume from Clean and Enterprise layers. This means their jobs will likely have to perform many joins, process lots of rows, and implement complex logic. In other cases, consumers have a high tolerance for delays and opt to use more curated data sets to keep their processes simple, light-weight, and fast.
The Clean and Enterprise layers can cause some confusion because they have some similarities with each other. The following chart provides a comparison between the two layers.
|Aspect||Clean Layer||Enterprise Layer|
|Naming||Applies naming standards to raw names .
|Applies naming standards to conform our Enterprise logical data model.
|Data Types||Applies data type standards to raw definitions and can add a partition column.
|Applies enterprise data types for consistent use.
|Business Logic||Performs corrections in technical debt. For example, fills in missing values, removes duplicates, and breaks apart overloaded columns.||Performs business logic that is approved by Data Governance, has not changed for 30+ days, and is used by 2+ business lines. For example,
|Data Sources||Uses Raw data and Clean lookup tables from the same source system.||Uses Clean data and Enterprise lookups from the same Data Domain (like Subscriptions).|
|Data Model Scope||Uses the Source system.||Uses the Enterprise.|
In the past, data producers had minimal responsibilites for their data. They never really had to think about data before and now they were forced to consider things that previously went unadressed. With a layered architecture in place, we now needed much more from our data producers. There were obvious benefits to this model for data consumers, but what about data producers?
Well data producers saw benefits too, along with the obvious benefits to data consumers. The new layered architecture provided data producers with:
Change is not always easy, especially when dealing with data, but the adoption has been rapid. We have now 108 clean databases owned by 97 source data producers, and we greatly reduced our overall table count from 160,000+ tables to 2,600 (-98.4%) due to eliminating widespread data duplication with the Enterprise Layer.
Now, 79% of our data consumers trust and understand the data; almost double what we saw at the start (41%). Finding the right data doesn’t have to be so hard anymore…
Cover Photo Attribution: Photo generated using DALL-E2 through GoCaaS. GoCaaS is GoDaddy’s own internal centralized platform that provides generative AI for GoDaddy employees and products to consume.