Data teams are often the last to know about data-related issues. They commonly find out only when an executive messages them about a broken dashboard. This is comparable to finding out about your servers being down only when your end users report it! In software engineering, this problem is solved with observability tools like Datadog and SignalFx. These monitor your system over time by tracking metrics (like CPU, memory usage or any arbitrary value), and sending alerts when they hit thresholds or are anomalous.
Metaplane solves this problem for data teams. We continuously monitor our users’ data warehouse tables and columns, testing for things like row counts, freshness, cardinality, uniqueness, nullness, and statistical properties like mean/median/min/max, as well as schema changes. After we build up a baseline of data points for each of these tests, we send alerts on anomalies to the user's Slack channel. Each alert includes metadata like upstream/downstream tables and BI dashboards affected by the issue, so that the user can assess how important the issue is and how quickly it should be addressed.
We're particularly careful about alert fatigue and false positives. Since we can't ask users to set manual thresholds (they would be changing all the time), we have to make a reasonable prediction based on past data, which can result in false positives and false negatives. If we under-alert, we miss important issues, but if we over-alert, users become desensitized and start ignoring alerts. Our solution is to include "Mark as anomaly" and "Mark as normal" buttons with each alert, for users to provide feedback to the model.
To give a common example, Metaplane can tell you that a revenue metric in a Snowflake column has spiked from $100 to $10,000 in an unexpected way. The alert includes upstream dependencies in dbt and downstream Looker dashboards that are impacted. Another example is if a table in Redshift that is usually updated every day hasn’t been updated in over 48 hours. A third example is if a table in BigQuery that typically increments 10M rows every day suddenly adds only 1M rows because of an upstream vendor bug. These are all what we think of as “silent data bugs” — all systems are green, but your data is just wrong!
Over the last eight months, we've caught problems like these for data teams at dozens of companies including Imperfect Foods, Drift, Vendr, Reforge, Air Up, Teachable, and Appcues.
Today, we’re excited to launch our self-serve product and free plan with the HN community. Setting up monitoring for your data stack takes less than 10 minutes. Here's a 4 minute demo video to see how it works: https://www.loom.com/share/1aa54eb8b45548e180f6ab3a4a580cc5. We make money by charging for more tests and team/enterprise features. You can use our new free plan or try out all of our features in a 30 day trial, no credit card required.
Our goal is to help data teams of any size be the first to know about data issues. We think observability will become as much of a no-brainer to data teams as it is to software engineers today. Starting on AWS?—get Datadog. Bringing on Snowflake?—get a data observability tool (hopefully ours!). Eventually we want to support more use cases that you’d expect from a Datadog for data, like log centralization and diagnostics, spend monitoring, performance insights, and deep integration with upstream applications. For now, we’re just starting where the pain is highest.
We'd love to hear your ideas, experiences, and feedback, and will be answering any questions in the comments!