We worked together for the last 4 years at a startup where we built and managed data products for insurers and banks. A common pattern we saw was teams taking data from their internal tools (CRM, HR system, etc.), application databases, and 3rd party data and storing it in a warehouse for analysis. However, when analysts/data scientists used the data for reports they would spot something suspicious and the engineering team would have to manually go through the data pipelines to find the source of the problem. More often than not it was simple things like a spike in missing values because an ETL job failed or stale data because a 3rd party data source hadn’t updated correctly. We realised that reliability/ trustworthiness of the raw data was essential before you could start abstracting away more interesting tasks like analysis, insight or predictions.
We wanted to do this without having to write and maintain lots of individual tests in our code. So we built Hubble, which connects to a data warehouse and creates tests based on the type of data being stored (i.e. freshness of timestamps, the cardinality of strings, max value of numbers, missing values, etc.). We’ve also added the ability to write any custom tests using a built-in SQL editor. All the tests run on a schedule and you’ll get an email or slack alert when they fail. We’re also building webhooks and an Airflow operator so you can run tests immediately after running an ETL job or trigger a process to fix a failing test.
Instead of asking users to send their data to us, the tests are run in the data warehouse and we track the test results over time. Today we support BigQuery, Snowflake and Rockset (which lets us work with MongoDB and DynamoDB) and are adding more on request.
We’re planning on charging $200 a month for a few seats, and $30-50 for extra users after that.
We’re still at an early access stage but want the HN community’s feedback so we’ve opened up access to the app for a few days, you can try it out here https://gethubble.io/hn. We’ve added a demo data warehouse you can start with that has data on COVID-19 cases in Italy and bike-share trips in San Francisco. Thanks and looking forward to hearing your ideas, experiences and feedback!