Christos, Damien and Nodar here and we're the co-founders of Synth (https://getsynth.com) - Synth is an API which allows you to quickly and easily provision test databases with realistic data with which to test your application.
We started our company about a year ago, after working at a quantitative hedge fund in London where we built models to trade US equities. Strangely, instead of spending time developing models or building the trading system, a large portion of our time was spent on just sourcing and on-boarding datasets to train and feed our models. The process of testing datasets and on-boarding them was archaic; one data provider served us XML files over FTP which we then had to spend weeks transforming for our models to ingest. A different provider asked us to spin up our own database and then sent us a binary which was used to load the data. We had to whitelist their API ip-address and setup a cronjob to make sure the dataset was never out of date. The binary provided an interactive input so it couldn't be scripted, or rather it could be but you need something to mock the interactive params. All this took a junior developer on the team a good 3-4 days to figure out and setup. Furthermore after our trial expired we decided we didn't actually need this dataset so those 3-4 days were essentially wasted. Our frustration around the status-quo in data distribution is what drove us to start our company.
We spent the first 6 months building a privacy-aware query engine (think Presto but with built in privacy primitives), but software developers we talked to would frequently divert the topic to the lack of high quality, sanitised testing data during the software development lifecycle. It was strange - most of us developers and data scientists constantly use some sort of testing data for different reasons. Maybe you want a local development environment which is representative of production but clean from customer data. Or a staging environment which contains a much smaller, representative database so that tests run faster. You could want the dataset to be much bigger to test how your application scales. Maybe you want to share your database with 3rd party contractors who you don't necessarily trust. Whichever way you put it, it's strange that for a problem most of us face every day, we have no idiomatic solution. We write bespoke scripts and pipelines which often break. They are time consuming to write and maintain and every time your schema changes you need to update them manually. Or we get lazy and copy/paste production.
We finally listened to all this feedback, dropped the previous product, and built Synth instead. Synth is a platform for provisioning databases with completely synthetic data.
The way Synth works can be broken into 3 main steps. You first download our CLI tool (a bunch of python wrapped up in a container) and point it at your database to create a model (we host the models on the Synth platform). This model encodes your schema, and foreign key relationships as well as a semantic representation of your types. We currently use simple regular expressions to classify the semantic types (for example an address or license plate). The whole model is represented as a JSON object - if the classifier gets something wrong you can easily change the semantic type. Once the model has been created, the next step is to train the model. Under the hood we use a combination of copulas and deep-learning models to model the distributions and correlations in your dataset (the intuition here is that it's much more useful for developers to have realistic data than just sample from a random number generator). The final step is to use the trained model to generate synthetic data. You can either sample directly from the model or we can spin up a database for you and fill it with as much data as you need. The generation step samples from the trained model to create realistic data, as well as utilising bespoke generators for sensitive fields (credit card numbers, names, addresses etc.)
You can run the entire lifecycle in a single command - you point the CLI tool at your database (currently Postgres, MySQL and MsSQL) and in ~1 minute you get an i.p. address and credentials to your new database with completely synthetic data.
We're long time fans of HN and are eagerly looking forward to feedback from the community (especially criticism). We've made a free version available for this week so you can try it with no strings attached. We hope some of you will find Synth useful. If you have any questions we'll be around throughout the day. Also feel free to get in touch via the site.
Thanks! ~ Christos, Damien & Nodar