The MLOps industry has matured rapidly for traditional ML (typically open-source models hosted in-house), but companies using LLMs are suffering from a lack of tooling to support things like experimentation, version control, and monitoring. They’re forced to build these tools themselves, taking valuable engineering time away from their core product.
There are 4 main pain points. (1) Prompt engineering is tedious and time consuming. People iterate on prompts in playgrounds of individual model providers and store results in spreadsheets or documents. Testing across many test cases is usually not done because of the manual nature of prompt engineering. (2) LLM calls against a corpus of text are not possible without semantic search. Due to limited context windows, any time an LLM has to return factual data from a set of documents, companies need to create embeddings, store them in a vector database and host semantic search models to query for relevant results at runtime; building this infrastructure is complex and time consuming. (3) There is limited observability / monitoring once LLMs are used in production. With no baseline for how something is performing, it’s scary making changes to it for fear of making it worse; and (4) Creating fine-tuned models and re-training them as new data becomes available is rarely done despite the potential gains (higher quality, lower cost, lower latency, more defensibility). Companies don’t usually have the capacity to build the infrastructure for collecting high-quality training data and the automation pipelines used to re-train and evaluate new models.
We know these pain points from experience. Sidd and Noa are engineers who worked at Quora and DataRobot building ML tooling. Then the three of us worked together for a couple years at Dover (YC S19), where we built features powered by GPT-3 when it was still in beta. Our first production feature was a job description writer, followed by a personalized recruiting email generator and then a classifier for email responses.
We found it was easy enough to prototype, but taking features to production and improving them was a different story. It was a pain to keep track of what prompts we had tried and to monitor how they were performing under real user inputs. We wished we could version control our prompts, roll back, and even A/B test. We found ourselves investing in infrastructure that had nothing to do with our core features (e.g. semantic search). We ended up being scared to change prompts or try different models for fear of breaking existing behavior. As new LLM providers and foundation models were released, we wished we could compare them and use the best tool for the job, but didn’t have the time to evaluate them ourselves. And so on.
It’s clear that better tools are required for businesses to adopt LLMs at scale, and we realized we were in a good position to build them, so here we are! Vellum consists of 4 systems to address the pain points mentioned above:
(1) Playground—a UI for iterating on prompts side-by-side and validating them against multiple test cases at once. Prompt variants may differ in their text, underlying model, model parameters (e.g. “temperature”), and even LLM provider. Each run is saved as a history item and has a permanent url that can be shared with teammates.
(2) Search—upload a corpus of text (e.g. your company help docs) in our UI (PDF/TXT) and Vellum will convert the text to embeddings and store it in a vector database to be used at run time. While making an LLM call, we inject relevant context from your documents into the query and instruct the LLM to only answer factually using the provided context. This helps prevent hallucination and avoids you having to manage your own embeddings, vector store, and semantic search infra.
(3) Manage—a low-latency, high-reliability API wrapper that’s provider-agnostic across OpenAI, Cohere, and Anthropic (with more coming soon). Every request is captured and persisted in one place, providing full observability into what you’re sending these models, what they’re giving back, and their performance. Prompts and model providers can be updated without code changes. You can replay historical requests and version history is maintained. This serves as a data layer for metrics, monitoring, and soon, alerting.
(4) Optimize—the data collected in Manage is used to passively build up training data, which can be used to fine-tune your own proprietary models. With enough high quality input/output pairs (minimum 100, but depends on the use case), Vellum can produce fine-tuned models to provide better quality, lower cost or lower latency. If a new model solves a problem better, it can be swapped without code changes.
We also offer periodic evaluation against alternative models (i.e. we can see if fine-tuning Curie produces results of comparable quality to Davinci, but at a lower price). Even though OpenAI is the dominant model provider today, we expect there to be many providers with strong foundation models, and in that case model interoperability will be key!
Here’s a video demo showcasing Vellum (feel free to watch on 1.5x!): https://www.loom.com/share/5dbdb8ae87bb4a419ade05d92993e5a0.
We currently charge a flat monthly platform fee that varies based on the quantity and complexity of your use-cases. In the future, we plan on having more transparent pricing that’s made up of a fixed platform fee + some usage-based component (e.g. number of tokens used or requests made).
If you look at our website you’ll notice the dreaded “Request early access” rather than “Try now”. That’s because the LLM Ops space is evolving extremely quickly right now. To maximize our learning rate, we need to work intensively with a few early customers to help get their AI use cases into production. We’ll invite self-serve signups once that core feature set has stabilized a bit more. In the meantime, if you’re interested in being one of our early customers, we’d love to hear from you and you can request early access here: https://www.vellum.ai/landing-pages/hacker-news.
We deeply value the expertise of the HN community! We’d love to hear your comments and get your perspective on our overall direction, the problems we’re aiming to solve, our solution so far, and anything we may be missing. We hope this post and our demo video provide enough material to start a good conversation and we look forward to your thoughts, questions, and feedback!