Deepgram (YC W16) – Scalable Speech API for Businesses

Hey HN,

I’m Scott Stephenson, one of the cofounders of Deepgram (https://www.deepgram.com/). Getting information from recorded phone calls and meetings is time-intensive, costly, and imprecise. Our speech recognition API allows businesses to reliably translate high-value unstructured audio into accurate, parsable data.

Deepgram started when my cofounder Noah Shutty and I had just finished looking for dark matter (while in a particle physics lab at University of Michigan). Noah had the idea to start recording all audio from his life, 24/7. After gathering hundreds of hours of recordings, we wanted to search inside this fresh dataset, but realized there wasn’t a good way to find specific moments. So, we built a tool utilizing the same AI techniques we used for finding dark matter particle events, and it ended up working pretty well. A few months later, we made a single page demo to show off “searching through sound” and posted to HN. Pretty soon we were in the winter batch of YC in 2016 (https://techcrunch.com/2016/09/27/launching-a-google-for-sou...).

I’d say we didn’t know what we were getting ourselves into. Speech is a really big problem with a huge market, but it’s also a tough nut to crack. For decades, companies have been unable to get real learnings from their massive amounts of recorded audio (some companies record more than 1,000,000 minutes of call center calls every single day). They have a few reasons why they record the audio — some for compliance, some for training, and some for market research. The questions they’re trying to answer are usually as simple as:

  - “What is the topic of the call?” 
  - “Is this call compliant?” (did I say: my company name, my name, and “this call may be recorded”)
  - “Are people getting their problems solved quickly?” 
  - “Do my agents need training?” 
  - “What are our customers talking about? Competitors? Our latest marketing campaign?”

It’s the most intimate view you can get on your customers, but the problem is so large and difficult to solve that companies pushed it into the corner over the past couple decades, only trying to mitigate the bleeding. Current tools only transcribe with around 50-60% accuracy on real-world, noisy, accented, industry-specific audio (don’t believe the ‘human level accuracy’ hype). When companies start solving problems using speech data, they first want transcription that’s accurate. After accuracy, comes scale — another big problem. Speech processing is computationally expensive and slow. Imagine trying to get into an iterative problem solving loop when you have to wait 24 hours to get your transcripts back.

So we’ve set our sights on building the speech company. Competition from companies like Google, Amazon, and Nuance is real, but none of these approach speech recognition like we do. We've rebuilt the entire speech processing stack, replacing heuristics and stats based speech processing with fully end-to-end deep learning (we use CNNs and RNNs). Using GPUs, we train speech models to learn customer’s unique vocabularies, accents, product names, and acoustic environments. This can be the difference between correctly capturing “wasn’t delivered” and “was in the liver.” We’ve focused on speed since we think that’s very important for exploration and scale. Our API returns hour-long transcripts interactively in seconds. It’s a tool many businesses wish they had.

So far we’ve released tools that:

  - transcribe speech with timestamps
  - support real-time streaming
  - have multi-channel support
  - understand multiple languages (in beta now)
  - allow you to deeply search for keywords and phrases
  - transcribe to phonemes
  - get more accurate with use

Some of those are better mousetraps of things you’re familiar with and some are completely new levers to pull in your audio data. We’ve built the core on English but now we’re releasing the tools for all of the Americas. (aside: You can transfer learn speech and it works well!)

Accuracy will continue to improve for transcription, but I think we can do more. It's such a large problem, and we really want to make a dent in “solving speech”. That means asking, truly: “What can a human do?“

People can, with little context, jump into a conversation and determine:

  - What are the words? When are they said? Who said what?
  - Is this person young/old? Male/Female? Exhausted/energetic?
  - Where is there confusion?
  - What language are they speaking? What’s the speaker’s accent?
  - What’s the topic of the conversation? Small talk or real? Is it going well?

Some of those things are being worked on now: additional language support, language and accent detection, sentiment analysis, auto-summarization, topic modeling, and more.

We’d love to hear your feedback and ideas.

Deepgram (YC W16) – Scalable Speech API for Businesses

Get Top 5 Posts of the Week