Principles of Data layers in Data Platform

Staging

The staging layer is the first level of data storage on a data platform. It stores data in a format that is closely aligned with the data structure of the operational system. The staging layer is divided into two sublayers: the raw layer and the standardized layer.

Raw

Data from source systems is loaded into the raw layer in the destination data warehouse. The destination can be Snowflake, Google BigQuery, AWS RedShift, Delta Lake, or AWS S3.
ELT tools such as Fivetran, Stitchdata, Airbyte, and custom connectors can be used to load the data from the source systems into the destination.
The raw layer should be treated with extra care to ensure that no users or processes modify the schema and table structure. The data engineering team should own the raw layer since its state is managed by ELT tools in the modern data stack.
Data is loaded to the raw layer as-is, with additional metadata attributes such as data_loaded_at and file_name
Each source system’s data should have its own schema in the raw layer for auditing purposes. Schema names should follow the following naming convention: _The table name in the raw layer should maintain the source system table name.

Analogy

As a first step to preparing the food in restaurants, we need to collect the ingredients from suppliers. It is similar to collecting data from operational systems. This is where data contracts play a critical role in ensuring,

1) Arrival of data as expected on the platform. Any availability, or accessibility issues in the operational system impact the timely delivery of data to businesses (consumers).

2) Both parties need to agree on how data (ingredients) can be agreed upon between two parties’ data producer and the platform team (supplier and restaurants) on an SLA.

3) Changes to the structure of the data (changes to the ingredients) should be notified to the platform team (restaurants). Otherwise restaurants cannot make the food as per menu

Standardised

Similar to raw layer with following definitions,

Right data type: ELT tools do not always store columns in the right data type when they load data. We had challenges with a few ELT tools that loaded the data into VARCHAR in your destination.
Meta-data first: Tables and attributes in your destination should have a description to enable analysts to easily discover and access the data. When you use the data catalog tool, meta data can be easily sourced from your destination. Also, meta data should be centrally managed in your warehouse.
Metadata can also be used to classify your data into PII categories and more. Definitions for the data need to be defined by the data owner in the business, working with the data governance team.
Manage history: There are instances where you cannot maintain the history of changes from your sources in the raw layer due to the lack of a way to identify the delta and the limitations of your data pipeline. It is recommended to keep the history of changes as early as possible in the data layer.
Subset of data (optional): Unless there is a need for the data to be used by analysts to build ad-hoc analyses, there is no need to bring the data into standardised layer.

How can we make the data platform a success?

Continual sessions with analysts and business required to make them understand the data layers and its rationale
Monitoring usage history allows data platform team to measure how frequently data is accessed, frequently used dataset and long-running queries to make pro-active steps to improve the performance of platform
Agility enables the business to trust in the foundation of the data platform, which should be sustainable, scalable and reliable
Continual feedback and iteration help the data platform team identify the challenges from data analyst team and conduct sessions if required to increase the usage of data platform

Principles of Data layers in Data Platform

Data layers: deep dive

Analogy

Staging

Raw

Standardised

Core

Prep

Conformed

Presentation

Data mart

Report

How can we make the data platform a success?

Get Sharing

Principles of Data layers in Data Platform

Data layers: deep dive

Analogy

Staging

Raw

Standardised

Core

Prep

Conformed

Presentation

Data mart

Report

How can we make the data platform a success?