Dioptra (YC W22) – Improve ML models by improving their training data

Hi HN! We're Pierre, Jacques, and Farah from Dioptra (https://dioptra.ai). Dioptra tracks ML metrics to identify model error patterns and suggest the best data curation strategy to fix them.

We’ve seen a shift in paradigm in recent years in ML: the “code” has become a commodity: many powerful ML models are open source today. The real challenge is to grow and curate quality data. This raises the need for new data centric tools: IDEs, debuggers, monitoring. Dioptra is a data centric tool that helps debug models and fix them by systematically curating and growing the best data, at scale.

We experienced this problem, first hand, deploying and retraining models. Once a model was in production, maintenance was a huge pain. First, it was hard to assess model performance. Accessing the right production data to diagnose was complicated. We had to build custom scripts to connect to DBs, download production data (Compliance, look the other way!) and analyze it.

Second, it was hard to translate the diagnosis into concrete next steps: find the best data to fix and retrain my model. It required another set of scripts to sample new data, label it and retrain. With a large enough labeling budget, we were able to improve our models, but it wasn’t optimal: labeling is expensive, and random data sampling doesn’t yield the best results. And since the process relied on our individual domain expertise (aka gut feelings) it was inconsistent from one data scientist to the next and not scalable.

We talked to a couple hundred ML practitioners who helped us validate and refine our thinking (we thank every single one of them!). For example, one NLP team had to read more than 10 long legal contracts per week per person. The goal was to track any model errors. Once a month, they synthesized an Excel sheet to detect patterns of errors. Once detected, they had to read more contracts to build their retraining dataset! There were multiple issues with that process. First, the assessment of errors was subjective since it depended on individual interpretations of the legal language. Second, the sourcing of retraining data was time consuming and anecdotal. Finally, they had to spend a lot of time coaching new members to minimize subjectivity.

Processes like this highlight how model improvement needs to be less anecdotal and more systematic. A related problem is lack of tooling, which puts a huge strain on ML teams that are constantly asked to innovate and take on new projects.

Dioptra computes a comprehensive set of metrics to give ML teams a full view of their model and detect failure modes. Teams can objectively prioritize their efforts based on the impact of each error pattern. They can also slice and dice to root-cause errors, zero in on faulty data, and visualize it. What used to take days of reading can now be done in a couple hours. Teams can then quality check and curate the best data for retraining using our embedding similarity search or active learning techniques. They can easily understand, customize and systematically engineer their data curation strategy with our automation APIs in order to get the best model at each iteration and stay on top of the latest production patterns. Additionally, Dioptra fits within any ML stack. We have native integrations with major deep learning frameworks.

Some of our customers reduced their data ops costs by 30%. Others improved their model accuracy by 20% in one retraining cycle thanks to Dioptra.

Active Learning, which has been around for a while but was sort of confidential until recently, makes intentional retraining possible. This approach has been validated by ML organizations like Tesla, Cruise and Waymo. Recently, other companies like Pinterest started building similar infrastructure. However it is costly to build and requires specialized skills. We want to make it accessible to everybody.

We created an interactive demo for HN: https://capture.navattic.com/cl4hciffr2881909mv2qrlsc9g

Please share any feedback and thoughts. Thanks for reading!



Get Top 5 Posts of the Week



best of all time best of today best of yesterday best of this week best of this month best of last month best of this year best of 2023 best of 2022 yc w24 yc s23 yc w23 yc s22 yc w22 yc s21 yc w21 yc s20 yc w20 yc s19 yc w19 yc s18 yc w18 yc all-time 3d algorithms animation android [ai] artificial-intelligence api augmented-reality big data bitcoin blockchain book bootstrap bot css c chart chess chrome extension cli command line compiler crypto covid-19 cryptography data deep learning elexir ether excel framework game git go html ios iphone java js javascript jobs kubernetes learn linux lisp mac machine-learning most successful neural net nft node optimisation parser performance privacy python raspberry pi react retro review my ruby rust saas scraper security sql tensor flow terminal travel virtual reality visualisation vue windows web3 young talents


andrey azimov by Andrey Azimov