Talc AI (YC S23) – Test Sets for AI

Read Post

Hey all! Max and Matt here from Talc AI. We do automated QA for anything built on top of an LLM. Check out our demo: https://talc.ai/demo

We’ve found that it's very difficult to know how well LLM applications (and especially RAG systems) are going to work in the wild. Many companies tackle this by having developers or contractors run tests manually. It’s a slow process that holds back development, and often results in unexpected behavior when the application ships.

We’ve dealt with similar problems before; Max was a staff engineer working on systematic technical solutions for privacy problems at facebook, and Matt worked on ML ops on facebooks’ election integrity team, helping run classifiers that handled trillions of data points. We learned that even the best predictive systems need to be deeply understood and trusted to be useful to product teams, and set out to build the same understanding in AI.

To solve this, we take ideas from academia on how to benchmark the general capabilities of language models, and apply them to generating domain specific test cases that run against your actual prompts and code.

Consider an analogy: If you’re a lawyer, we don’t need to be lawyers to open up a legal textbook and test your knowledge of the content. Similarly if you’re building a legal AI application, we don’t need to build your application to come up with an effective set of tests that can benchmark your performance.

To make this more concrete - when you pick a topic in the demo, we grab the associated wikipedia page and extract a bunch of facts from it using a classic NLP technique called “named entity recognition”. For example if you picked FreeBASIC, we might extract the following line from it:

    Source of truth: "IDEs specifically made for FreeBASIC include FBide and FbEdit,[5] while more graphical options include WinFBE Suite and VisualFBEditor."

This line is our source of truth. We then use an LLM to work backwards from this fact into a question and answer:

    Question: "What programming language are the IDEs WinFBE Suite and FbEdit designed to support?"
    Reference Answer: "FreeBasic"

We can then evaluate accurately by comparing the reference answer and the original source of truth– this is how we generate “simple” questions in the demo.

In production we’re building this same functionality on our customers' knowledge base instead of wikipedia. We then employ a few different strategies to generate questions – these range from simple factual questions like “how much does the 2024 chevy tahoe cost”, to complex questions like “What would a mechanic have to do to fix the recall on my 2018 Golf?” These questions are based on facts extracted from your knowledge base and real customer examples.

This testing and grading process is fast – it’s driven by a mixture of LLMs and traditional algorithms, and can turn around in minutes. Our business model is pretty simple - we charge for each test created. If you opt to use our grading product as well we charge for each example graded against the test.

We’re excited to hear what the HN community thinks – please let us know in the comments if you have any feedback, questions or concerns!

Talc AI (YC S23) – Test Sets for AI

Get Top 5 Posts of the Week