I stumbled into this idea as a data engineer for Disney+’s subscriptions team. We were “firefighters for data,” ready to debug huge pipelines that always crashed and burned. The worst part of my job at Disney+ was the graveyard on-call rotations, where pagers from 12am to 5am were guaranteed, and you'd have to dig through thousands of lines of someone else’s SQL. SQL is long-winded—1000 lines of SQL can often be summarized with 10 key transforms. We take this SQL and summarize those transforms with reusable, testable, scalable Spark objects.
Serra is written in PySpark and modularizes every component of ETL through Spark objects. Similar to dbt, we apply software engineering best practices to data, but we aim to do it not just with transformations, but with data connectors as well. We accomplish this with a configuration YAML file—the idea is if we have a pipeline with said 1000 line SQL script that is using third-party connectors, we can summarize all of this into a 12-block config file that gives easy high-level overhead and debugging capabilities—10 blocks for the transforms and 2 for the in-house connectors. Then, we can add tests and custom alerts to each of these objects/blocks so that we know where exactly the pipeline breaks and why.
We are open-source to make it easy to customize Serra to whatever flavor you like with custom transformers/connectors. The connectors we support OOB are Snowflake, AWS, BigQuery, and Databricks and are adding more based on feedback. The transforms we support include mapping, pivoting, joining, truncating, imputing, and more. We’re doing our best to make Serra as easy to use as possible. If you have docker installed, you can run this docker command to instantly get setup with a Serra environment to create modular pipelines.
We wrap up our functionality with a command line tool that lets you: - create your ETL pipelines, test them locally with a subset of your data, and deploy them to the cloud (currently we only support Databricks, but will soon support others and plan to host our own clusters too). It also has an experimental “translate” feature which is still a bit finicky, but the idea is to take your existing SQL script and get suggestions on how you can chunk up and modularize your job with our config. It’s still just a super early suggestion feature that is definitely not fleshed out, but we think it’s a cool approach.
Here’s a quick demo going through retooling a long-winded SQL script to an easily maintainable, scalable ETL job: https://www.loom.com/share/acc633c0ec03455e9e8837f5c3db3165?.... (docker command: docker run --mount type=bind,source="$(pwd)",target=/app -it serraio/serra /bin/bash)
We don’t see or store any of your data—we’re a transit layer that helps you write ETL jobs that you can send to your warehouse of choice with your actual data. Right now we are helping customers retool their messy data pipelines and plan to monetize by hosting Serra on the cloud, charging if you run the job on our own clusters, and per API call on our translate feature (once it’s mature).
We’re super excited to launch this to Hacker News. We’d love to hear what you think. Thanks in advance!