GradientJ (YC W23) – Build NLP Applications Faster with LLMs

Read Post

Hey HN, we’re Daniel and Oscar, founders of GradientJ (https://gradientj.com), a web application that helps teams develop, test, and monitor natural language processing (NLP) applications using large language models (LLMs).

Before GradientJ, we’d been building NLP applications for 4 years, using transformer models like BERT. With the advent of LLMs and their zero-shot/few-shot capabilities, we saw the NLP dev cycle get flipped on its head. Rather than having to hire an army of data labelers and data scientists to fine-tune a BERT model for your use case, engineers can now use LLMs, like GPT-4, to build NLP endpoints in minutes.

As powerful as this is, the problem becomes that without appropriate tools for version control, regression testing, and ongoing maintenance like monitoring and A/B testing, managing these models is a pain. Because the data being evaluated is often fuzzy, developers either have to build complex text processing regex pipelines or manually evaluate each output before a new release. Moreover, if your prompts are only maintained in a notion doc or google sheet, completely separate from these tests, it’s difficult to identify what the changes were that led to underperformance. The workflow often devolves into manual and subjective human data labeling just to decide if new versions of your model are “good enough” to deploy.

GradientJ is a web application and API to address that. We let you iterate on prompts, automatically regression test them along multiple dimensions, and finally manage them once deployed.

You’d think these are pretty straightforward things to build, but we’ve noticed most versions of “LLM management apps” focus on organizing the workflow for these components without dramatically improving on automating them. At the end of the day, you still have to pass your side-by-side prompt comparison through the “eye-ball test” which creates processes that are bottlenecked by human time. We think by using the very same technology, NLP, you can dramatically reduce the developer labor required for each of these steps.

Here’s how we do it:

For prompt iteration, rather than just a text-editor “playground” with some special syntax to delineate variables, we’re trying to use large language models to create a Copilot-like experience for prompt engineering. This means aggregating all the tricks of prompt engineering behind a smart LLM assistant who can suggest ways to restructure your prompt for better output. For example, when someone just wants their output in JSON form, we know where to inject the appropriate text to nudge the model towards generating JSON. When combined with our regression testing API, those prompt suggestions will actually be based on the specific dimensions of prompt underperformance. The idea is that the changes required to make a prompt’s output follow a certain structure are different from the ones you’d make to have the output follow a certain tone.

When it comes to testing, even before LLMs, configuring high quality tests for expressive NLP models has historically been hard. To compare anything more complicated than classification labels, most people resort to raw fuzzy string comparisons, or token distribution differences between the output. We’re trying to make automated NLP testing more objective by using LLMs to actually power our regression testing API. We use NLP models to provide comparisons between text outputs along custom dimensions like “structure”, “semantics”, and “tone”. This means before you deploy the latest version of your email generation model, you know where it stands along each of the discrete dimensions you care about. Additionally, this helps prevent your prompt engineering from becoming a game of “whack-a-mole”:overfitting your prompt on the handful of examples you can copy and paste while developing.

For deployment, we provide a stable API that always goes to the latest iteration of a prompt you’ve chosen to deploy. This means you can push updates over-the-air without having to change the API code. At the same time, we’re tracking the versions used for inference under the hood. This lets you use that data to further improve your regression tests, experiment with fine-tuning across other providers or open source models, or set up alerts around prompt performance.

Each of these pieces of our product can be used in isolation or all together, depending on what the rest of your NLP infrastructure looks like.

If you use LLMs and are looking for ways to improve your workflow, or if you need to build NLP applications fast and want to bypass the traditional slow data labeling process, we’d love your feedback!

GradientJ (YC W23) – Build NLP Applications Faster with LLMs

Get Top 5 Posts of the Week