ML models are defined by a combination of code and the data that the code trains on. A programmer must think hard about what behavior they want from their model, assemble a dataset of labeled examples of what they want their model to do, and then train their model on that dataset. As they encounter errors in production, they must collect and label data for the model to train on to fix these errors, and verify they're fixed by monitoring the model’s performance on a test set with previous failure cases. See Andrej Karpathy’s Software 2.0 article (https://medium.com/@karpathy/software-2-0-a64152b37c35) for a great description of this workflow.
My cofounder Quinn and I were early engineers at Cruise Automation (YC W14), where we built the perception stack + ML infrastructure for self driving cars. Quinn was tech lead of the ML infrastructure team and I was tech lead for the Perception team. We frequently ran into problems with our dataset that we needed to fix, and we found that most model improvement came from improvement to a dataset’s variety and quality. Basically, ML models are only as good as the datasets they’re trained on.
ML datasets need variety so the model can train on the types of data that it will see in production environments. In one case, a safety driver noticed that our car was not detecting green construction cones. Why? When we looked into our dataset, it turned out that almost all of the cones we had labeled were orange. Our model had not seen many examples of green cones at training time, so it was performing quite badly on this object in production. We found and labeled more green cones into our training dataset, retrained the model, and it detected green cones just fine.
ML datasets need clean and consistent data so the model does not learn the wrong behavior. In another case, we retrained our model on a new batch of data that came from our labelers and it was performing much worse on detecting “slow signs” in our test dataset. After days of careful investigation, we realized it was due to a change to our labeling process that caused our labelers to label many “speed limit signs” as “slow signs,” which was confusing the model and causing it to perform badly on detecting “slow signs.” We fixed our labeling process, did an additional QA pass over our dataset to fix the bad labels, retrained our model on the clean data, and the problems went away.
While there’s a lot of tooling out there to debug and improve code, there’s not a lot of tooling to debug and improve datasets. As a result, it’s extremely painful to identify issues with variety and quality and appropriately modify datasets to fix them. ML engineers often encounter scenarios like:
Your model’s accuracy measured on the test set is at 80%. You abstractly understand that the model is failing on the remaining 20% and you have no idea why.
Your model does great on your test set but performs disastrously when you deploy it to production and you have no idea why.
You retrain your model on some new data that came in, it’s worse, and you have no idea why.
ML teams want to understand what’s in their datasets, find problems in their dataset and model performance, and then edit / sample data to fix these problems. Most teams end up building their own one-off tooling in-house that isn’t very good. This tooling typically relies on naive methods of data curation that are really manual and involve “eyeballing” many examples in your dataset to discover labeling errors / failure patterns. This works well for small datasets but starts to fail as your dataset size grows above a few thousand examples.
Aquarium’s technology relies on letting your trained ML model do the work of guiding what parts of the dataset to pay attention to. Users can get started by submitting their labels and corresponding model predictions through our API. Then Aquarium lets users drill into their model performance - for example, visualize all examples where we confused a labeled car for a pedestrian from this date range - so users can understand the different failure modes of a model. Aquarium also finds examples where your model has the highest loss / disagreement with your labeled dataset, which tends to surface many labeling errors (ie, the model is right and the label is wrong!).
Users can also provide their model's embeddings for each entry, which are an anonymized representation of what their model “thought” about the data. The neural network embeddings for a datapoint (generated by either our users’ neural networks or by our stable of pretrained nets) encode the input data into a relatively short vector of floats. We can then identify outliers and group together examples in a dataset by analyzing the distances between these embeddings. We also provide a nice thousand-foot-view visualization of embeddings that allows users to zoom into interesting parts of their dataset. (https://youtu.be/DHABgXXe-Fs?t=139)
Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio.
After finding problems, Aquarium helps users solve them by editing or adding data. After finding bad data, Aquarium integrates into our users’ labeling platforms to automatically correct labeling errors. After finding patterns of model failures, Aquarium samples similar examples from users’ unlabeled datasets (green cones) and sends those to labeling.
Think about this as a platform for interactive learning. By focusing on the most “important” areas of the dataset that the model is consistently getting wrong, we increase the leverage of ML teams to sift through massive datasets and decide on the proper corrective action to improve their model performance.
Our goal is to build tools to reduce or eliminate the need for ML engineers to handhold the process of improving model performance through data curation - basically, Andrej Karpathy’s Operation Vacation concept (https://youtu.be/g2R2T631x7k?t=820) as a service.
If any of those experiences speak to you, we’d love to hear your thoughts and feedback. We’ll be here to answer any questions you might have!