dbt Data Profiling & Staying on Top of Tech Stacks

JMAN & dbt

At JMAN, we are proud to call ourselves ‘nerds’. In particular, we are proud to be nerdy about dbt, a tech-stack that we have had over three years of experience in delivering to our clients. In this time, it has revolutionised data transformation and we have very much grown alongside it.

A question we have also been asked a lot is: what is actually going on in my data? The answer here often helps answer the next question: is it worth cleaning out the cobwebs from the darkest corners of our Data Warehouse?

Whether it’s part of the standard governance process or a due diligence, getting a grip on the answers to these two questions is vital to building trust in what you have as well as avoiding future errors.

Data Quality

So what exactly are we talking about here or, in other words, what exactly do we want to assess:

Size – what is the size of our dataset?
Completeness – are we missing any data?
Relevance – how many unique values are in our data and are these useful?
Coverage – what range of values are covered?

Data Profiler

This problem has been tackled before in dbt (dbt_profiler). However, we found limitations with memory and maintaining the history of the profile result. In classic JMAN fashion (ie wanting to stay at the cutting edge of advances in dbt), we therefore decided to build and contribute our own Data Profiling script to the dbt community.

Accessible via the below link, the package is currently supported with both Snowflake and Postgres and, for each column in your database, produces detail on:

Row Counts
Null & non-null values
Distinct & unique values
Max, min & avg values

Importantly, the package supports schema-level profiling with exclude table options to ensure you are only getting the outputs you need.

Future Enhancements

This is also not the end of the journey. We expect further improvements from the community as well as having additional metrics, database level profiling and support for Microsoft SQL Server, AWS Redshift & Big Query in our own pipeline.

Access

In summary, this is a big step for both JMAN and our clients in terms of automating what should be a critical part of most data pipelines. Moreover, we hope to continue contributing to and help growing a thriving community in one of the industry’s most exciting tech-stacks.

For more information, please see the below link to the JMAN Group dbt repository:

https://hub.getdbt.com/jmangroup/data_profiler/latest/

Otherwise, feel free to reach out on Slack if you want to chat about dbt, any of our other work or, simply, to just have a chat – we’re here.

Get Sharing