BuildFlow (YC W23) – The FastAPI of data pipelines

Read Post

Hey HN! We’re Caleb and Josh, the founders of BuildFlow (https://www.buildflow.dev). We provide an open source framework for building your entire data pipeline quickly using Python. You can think of us as an easy alternative to Apache Beam or Google Cloud Dataflow.

The problem we're trying to solve is simple: building data pipelines can be a real pain. You often need to deal with complex frameworks, manage external cloud resources, and wire everything together into a single deployment (you’re probably drowning in Yaml by this point in the dev cycle). This can be a burden on both data scientists and engineering teams.

Data pipelines is a broad term, but we generally mean any kind of processing that happens outside of the user facing path. This can be things like: processing file uploads, syncing data to a data warehouse, or ingesting data from IoT devices.

BuildFlow, our open-source framework, lets you build a data pipeline by simply attaching a decorator to a Python function. All you need to do is describe where your input is coming from and where your output should be written, and BuildFlow handles the rest. No configuration outside of the code is required. See our docs for some examples: https://www.buildflow.dev/docs/intro.

When you attach the decorator to your function, the BuildFlow runtime creates your referenced cloud resources, spins up replicas of your processor, and wires up everything needed to efficiently scale out the reads from your source and then writes to your sink. This lets you focus on writing logic as opposed to interacting with your external dependencies.

BuildFlow aims to hide as much complexity as possible in the sources / sinks so that your processing logic can remain simple. The framework provides generic I/O connectors for popular cloud services and storage systems, in addition to "use case driven” I/O connectors that chain together multiple I/O steps required by common use cases. An example “use case driven” source that chains together GCS pubsub notifications & fetching GCS blobs can be seen here: https://www.buildflow.dev/docs/io-connectors/gcs_notificatio...

BuildFlow was inspired by our time at Verily (Google Life Sciences) where we designed an internal platform to help data scientists build and deploy ML infra / data pipelines using Apache Beam. Using a complex framework was a burden on our data science team because they had to learn a whole new paradigm to write their Python code in, and our engineering team was left with the operational load of helping folks learn Apache Beam while also managing / deploying production pipelines. From this pain, BuildFlow was born.

Our design is based around two observations we made from that experience:

(1) The hardest thing to get right is I/O. Efficiently fanning out I/O to workers, concurrently reading / processing input data, catching schema mismatches before runtime, and configuring cloud resources is where most of the pain is. BuildFlow attempts to abstract away all of these bits.

(2) Most use cases are large scale but not (overly) complex. Existing frameworks give you scalability and a complicated programming model that supports every use case under the sun. BuildFlow provides the same scalability but focuses on common use cases so that the API can remain lightweight & easy to use.

BuildFlow is open source, but we offer a managed cloud offering that allows you to easily deploy your pipelines to the cloud. We provide a CLI that deploys your pipeline to a managed kubernetes cluster, and you can optionally opt in to letting us manage your resources / terraform as well. Ultimately this will feed into our VS Code Extension which will allow users to visually build their data pipelines directly from VS Code (see https://launchflow.com for a preview). The extension will be free to use and will come packaged with a bunch of nice-to-haves (code generation, fuzzing, tracing, and arcade games (yep!) just to name a few in the works).

Our managed offering is still in private beta but we’re hoping to release our CLI in the next couple weeks. Pricing for this service is still being ironed out but we expect it to be based on usage.

We’d love for you to try BuildFlow and would love any feedback. You can get started right away by installing the python package: pip install buildflow. Check out our docs (https://buildflow.dev/docs/intro) and GitHub (https://github.com/launchflow/buildflow) to see examples on how to use the API.

This project is very new, so we’d love to gather some specific feedback from you, the community. How do you feel about a framework managing your cloud resources? We’re considering adding a module that would let BuildFlow create / manage your terraform for you (terraform state would be dumped to disk). What are some common I/O operations you find yourself rewriting? What are some operational tasks that require you to leave your code editor? We’d like to bring as many tasks into BuildFlow and our VSCode extension so you can avoid context switches.

BuildFlow (YC W23) – The FastAPI of data pipelines

Get Top 5 Posts of the Week