Dataform (YC W18) – Build Reliable SQL Data Pipelines as a Team

Hi HN!

We’re Guillaume and Lewis, founders of Dataform, and we're excited (and nervous) to be posting this on HN.

Dataform is a platform for data analysts to manage data workflows in cloud data warehouses such as Google BigQuery, Amazon Redshift or Snowflake. With our open source framework and our web app, analysts can develop and schedule reliable pipelines to turn raw data into reliable datasets they need for analytics.

Before starting Dataform, we managed engineering teams in AdSense and led product analytics for publisher ads. We heavily relied on data (and data pipelines!) to generate insights, drive better decisions and build better products. Companies like Google invest a lot to build internal data tools for analysts to manage data and build data pipelines. In 5 minutes I could define a new dataset in SQL that would be updated every day and then use it in my reports.

Most businesses today are centralising their raw data into cloud data warehouses but lack the tools to manage it efficiently. Pipelines run manually or via custom scripts that break often. Or the company decides to invest engineering resources to set up, maintain and debug a framework like Airflow. But that’s just for scheduling and the technical bar is often too high for analysts to contribute.

We saw a need for a self-service solution for data teams to manage data efficiently, so that analysts can own the entire workflow from raw data to analytics. We built Dataform with two core principles in mind:

1. Bring engineering best practices to data management. In Dataform, you build data pipelines in SQL, and our open source framework lets you seamlessly define dependencies, build incremental tables and reuse code across scripts. You can write tests against your raw and transformed data to ensure data quality across your analytics. Lastly, our development environment also facilitates the adoption of best practices, where analysts can develop with version control, code review or sandboxed environments.

2. Let data teams focus on data, not infrastructure. We want to bring a better, faster and cheaper alternative to what businesses have to build and maintain in-house today. Our web app comes with a collaborative SQL editor, where teams develop and push their changes to GitHub. You can then orchestrate your data pipelines without having to maintain any infrastructure.

Here's is a short video demo where we develop two new datasets, push the code to GitHub and schedule their execution, in under 5 minutes.

https://www.youtube.com/watch?v=axDKf0_FhYU

You can sign up at https://dataform.co. If you're curious how it works - here are the docs: https://docs.dataform.co and the link to our open framework: https://github.com/dataform-co/dataform

We would love to hear your feedback and answer any questions you might have!

Lewis and Guillaume



Get Top 5 Posts of the Week



best of all time best of today best of yesterday best of this week best of this month best of last month best of this year best of 2023 best of 2022 yc w24 yc s23 yc w23 yc s22 yc w22 yc s21 yc w21 yc s20 yc w20 yc s19 yc w19 yc s18 yc w18 yc all-time 3d algorithms animation android [ai] artificial-intelligence api augmented-reality big data bitcoin blockchain book bootstrap bot css c chart chess chrome extension cli command line compiler crypto covid-19 cryptography data deep learning elexir ether excel framework game git go html ios iphone java js javascript jobs kubernetes learn linux lisp mac machine-learning most successful neural net nft node optimisation parser performance privacy python raspberry pi react retro review my ruby rust saas scraper security sql tensor flow terminal travel virtual reality visualisation vue windows web3 young talents


andrey azimov by Andrey Azimov