Artie (YC S23) – Real time data replication to data warehouses

Read Post

Hey, I’m Robin and I’m the founder/CTO at Artie (https://www.artie.so/). We solve the problem of stale production data in the data warehouse. We’re open source, and you can try it for free here: https://github.com/artie-labs/transfer.

Specifically, we do real time data replication from databases to data warehouses. We leverage change data capture (CDC) and stream processing to perform data transfers efficiently, enabling sub-minute latency and significant cost savings. Here’s a quick demo (jumping straight to the important part): https://www.youtube.com/watch?v=uAi1tm4gd9U#t=81s.

I encountered this problem when I was a heavy data warehouse user at prior jobs. The data in our data warehouse was super lagged and analytics were always stale. Imagine a fintech company performing anti-fraud/transaction monitoring with batched ETLs and finding out that fraudulent transactions occurred 24 hours ago. This was very frustrating to me! Since my background was in distributed systems and database architecture, I knew that there was a better way to perform data transfers.

The more teams I spoke with, the more I realized this was a real pain point. People wanted real time data for analytics, fraud alerting, transaction monitoring, and training AI/ML models; but there wasn’t an easy out-of-the-box solution. Companies were either constrained on latency or schema integrity/data usability or data accuracy. Companies started telling me that if I built a tool that is robust, near real time, but also maintained schema integrity and data accuracy, they would very happily pay for it.

So I built Artie, a real time, open-source data streaming solution to transfer data from databases to data warehouses and handle schemas automatically in-flight (DMLs and DDLs).

Typical ETL solutions leverage batched processes that run on a schedule (DAGs, Airflow), which cannot achieve real time data syncs. This means that when companies aggregate production data into their data warehouse, the underlying data is always stale.

Streaming change data capture (CDC) logs is a more efficient way to transfer data, and helps lower networking/data ingestion costs as well. However, building data pipelines with CDC streaming is complicated. I wanted Artie to be the tool that abstracts away that complexity so that any company can benefit from having real time data.

A big challenge is implementing CDC streaming for stateful data (i.e. row updates/deletes) and schema changes (DDLs), which most streaming solutions just don’t address, meaning that complexity is passed down to the customer. A lot of in-house streaming solutions leverage some combination of Debezium + Kafka/Kinesis + Apache Flink and are able to achieve near real time syncs, but they only handle append-only events (inserts) and don't handle schema changes/schema evolution like DMLs and DDLs. Not handling thse means the data at the destination doesn't look exactly like the production database, which obscures the source of truth. You end up having to do additional work to make the data warehouse tables match the source DB.

So how do we offer a robust CDC streaming solution? We grab CDC logs using Debezium and/or our custom connector (which solves for certain edge cases that Debezium doesn’t handle) and push them into Kafka (or Google Pub/Sub). Kafka helps ensure ordering and ease of recovery upon an outage - we use one table per topic, and the partition key is the primary key(s) to ensure no out of order writes. Artie then consumes these events from Kafka and we have an in-memory DB with our typing library that can infer schemas (DML and DDL), performs optimizations like deduplications, and then flushes data to the data warehouse. When the data warehouse confirms a successful merge, we then commit the offset within Kafka. This all happens with sub-minute data latency even at high volumes (several TBs or billions of rows).

However the data looks in your database, it should look exactly the same in your data warehouse. We also strive to handle all data types with no exceptions - i.e. supporting TOAST columns, composite keys as primary keys, arrays, large rows (>1MBs), etc. (I say “strive” because I’m sure we haven’t seen all possible data types yet!).

We’ve been live for a while now. Several companies use us to update all their analytic dashboards in real time (with dashboards built on top of their data warehouse). Fintech platforms use us to perform financial transaction monitoring. A utilities software platform uses us to grab video/photo data to perform risk/hazard assessment against ML models.

Artie is live and supports PostgreSQL, MySQL, MongoDB sources and Snowflake, BigQuery, Redshift destinations. We make money from our hosted service. We charge based on usage (# of rows transferred per month, not including initial snapshots).

We’d love for you to try it out! You can get started with the open source version here: https://github.com/artie-labs/transfer. We have a small OSS Slack community (www.artie.so/slack) – feel free to ping us for help or any other requests.

For our hosted version, we need to ensure that we have enough storage and compute capacity provisioned, so we’re asking cloud users to hop on a quick call with us before we activate your account. Eventually we’ll have easy self-serve functionality but that’s not 100% built yet, so for now we set up a Slack channel to ensure smooth deployments. If you’re willing to work with us on that, we’ll be super excited to show you what we’ve got. Just email us at [email protected] or request access from https://www.artie.so.

We’d love for you to try the OSS or hosted solution and give us feedback! We’re eager to improve the product and test it against various workloads and data types :)

Artie (YC S23) – Real time data replication to data warehouses

Get Top 5 Posts of the Week