We're Allison and Steve of Syndetic (https://www.getsyndetic.com). Syndetic is a web app that data providers use to explain their datasets to their customers. Think ReadMe but for datasets instead of APIs.
Every exchange of data ultimately comes down to a person at one company explaining their data to a person at another. Data buyers need to understand what's in the dataset (what are the fields and what do they mean) as well as how valuable it can be to them (how complete is it? how relevant?). Data providers solve this problem today with a "data dictionary" which is a meta spreadsheet explaining a dataset. This gets shared alongside some sample data over email. These artifacts are constantly getting stale as the underlying data changes.
Syndetic replaces this with software connected directly to the data that's being exchanged. We scan the data and automatically summarize it through statistics (e.g., cardinality), coverage rates, frequency counts, and sample sets. We do this continuously to monitor data quality over time. If a field gets removed from the file or goes from 1% null to 20% null we automatically alert the provider so they can take a look. For an example of what we produce but on an open dataset check out the results of the NYC 2015 Tree census at https://www.getsyndetic.com/publish/datasets/f1691c5d-56a9-4....
We met at SevenFifty, a tech startup connecting the three tiers of the beverage alcohol trade in the United States. SevenFifty integrates with the backend systems of 1,000+ beverage wholesalers to produce a complete dataset of what a restaurant can buy wholesale, at what price, in any zipcode in America. While the core business is a marketplace between buyers and sellers of alcohol, we built a side product providing data feeds back to beverage wholesalers about their own data. Syndetic grew out of the problems we experienced doing that. Allison kept a spreadsheet in dropbox of our data schema, which was very difficult to maintain, especially across a distributed team of data engineers and account managers. We pulled sample sets ad hoc, and ran stats over the samples to make sure the quality was good. We spent hours on the phone with our customers putting it all together to convey the meaning and the value of our data. We wondered why there was no software out there specifically built for data-as-a-service.
We also have backgrounds in quantitative finance (D. E. Shaw, Tower Research, BlackRock), large purchasers of external data, where we've seen the other side of this problem. Data purchasers spend a lot of time up-front evaluating the quality of a dataset, but they often don’t monitor how the quality changes over time. They also have a hard time assessing the intersection of external datasets with data they already have. We're focusing on data providers first but expect to expand to purchasers down the road.
Our tech stack is one monolithic repo split into the frontend web app and backend data scanning. The frontend is a rails app and the data scanning is written in rust (we forked the amazing library xsv). One quirk is that we want to run the scanning in the same region as our customers' data to keep bandwidth costs and transfer time down, so we're actually running across both GCP and AWS.
If you're interested in this field you might enjoy reading the paper "Datasheets for datasets" (https://arxiv.org/pdf/1803.09010.pdf) which proposes a standardized method for documenting datasets modeled after the spec sheets that come with electronics. The authors propose that “for dataset creators, the primary objective is to encourage careful reflection on the process of creating, distributing, and maintaining a dataset, including any underlying assumptions, potential risks or harms, and implications of use.” We agree with them that as more and more data is sold, the chance of misunderstanding what’s in the data increases. We think we can help here by building qualitative questions into Syndetic alongside automation.
We have lots of ideas of where we could go with this, like fancier type detection (e.g. is this a phone number), validations, visualizations, anomaly detection, stability scores, configurable sampling, and benchmarking. We'd love feedback and to hear about your challenges working with datasets!