One of the biggest pain points in developing ETL pipelines – chains of jobs that move, clean, merge and aggregate analytical data – has been regression testing: verifying how a change in source code (mostly, SQL) affects the produced data.
Early in my career, as an on-call data engineer at Lyft, I accidentally introduced a breaking code change while attempting to ship a hotfix at 4AM to a SQL job that computed tables for core business analytics. A seemingly small change in filtering logic ended up corrupting data for all downstream pipelines and breaking dashboards for the entire company. Apart from being a silly mistake, this highlighted the lack of proper tooling for testing changes. If there had been a way to quickly compare the data computed by production code vs. the hotfix branch, I would have immediately spotted the alarming divergence and avoided merging the breaking change.
Without a diffing tool, the typical options for regression testing are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure values are within interval, etc.) – these are helpful, but costly investment. Frameworks such as dbt make it easier, but it’s often still prohibitively hard to verify all assumptions in a large table. (2) Write custom SQL queries to compare data produced by the prod and dev versions of the source code (e.g. compare counts, match primary keys). This can easily take up 100+ lines of SQL and hours of unsatisfying work, which no one really wants to do. (3) "Fuck It, Ship It" is always an option but too risky nowadays as analytical data not only powers dashboards but also production ML models.
As this problem is common in data engineering, some large organizations have built and open-sourced their solutions – for example, BigDiffy by Spotify. However, most of these tools are CLI-based and produce results in a plain-text format which is hard to comprehend when you are dealing with complex data.
To fit existing workflows of our users, we’ve built a web interface with interactive charts showing both diff summary statistics (e.g. % of different values by column) and value-level side-by-side comparison (git diff style). But since the mission of the tool is to save engineers as much time as possible, we also opened an API for automation through Airflow or other orchestrators, and built a Github workflow that runs diff on every pull request with changes to ETL code. Since billion-row-scale datasets are not uncommon nowadays, there is an optional sampling feature that helps keep compute costs low and get results within a few minutes no matter how large the dataset is.
We've found Datafold to be a good fit for the following workflows: (1) Developing data transformations – before an ETL job is shipped to production, it undergoes multiple iterations. Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless. (2) Code review & testing: large organizations have hundreds of people committing to ETL codebases. Understanding the impact of even a modest SQL diff is daunting. Datafold can produce a data diff for every commit in minutes so changes are well understood. (3) Data transfer validation: moving large volumes of data between databases is error-prone, especially if done via change data capture (CDC): a single lost event can affect the resulting dataset in a way that is tricky to debug. We allow comparing datasets across different databases, e.g. PostgreSQL & Snowflake.
We are passionate about improving tooling for data engineers and would love to hear about your experience with developing data pipelines and ensuring data quality. Also, if you think that dataset diffing can be helpful in other domains, we are very curious to learn from you!