Deepgram (YC W16) – Scalable Speech API for Businesses

Hey HN,

I’m Scott Stephenson, one of the cofounders of Deepgram (https://www.deepgram.com/). Getting information from recorded phone calls and meetings is time-intensive, costly, and imprecise. Our speech recognition API allows businesses to reliably translate high-value unstructured audio into accurate, parsable data.

Deepgram started when my cofounder Noah Shutty and I had just finished looking for dark matter (while in a particle physics lab at University of Michigan). Noah had the idea to start recording all audio from his life, 24/7. After gathering hundreds of hours of recordings, we wanted to search inside this fresh dataset, but realized there wasn’t a good way to find specific moments. So, we built a tool utilizing the same AI techniques we used for finding dark matter particle events, and it ended up working pretty well. A few months later, we made a single page demo to show off “searching through sound” and posted to HN. Pretty soon we were in the winter batch of YC in 2016 (https://techcrunch.com/2016/09/27/launching-a-google-for-sou...).

I’d say we didn’t know what we were getting ourselves into. Speech is a really big problem with a huge market, but it’s also a tough nut to crack. For decades, companies have been unable to get real learnings from their massive amounts of recorded audio (some companies record more than 1,000,000 minutes of call center calls every single day). They have a few reasons why they record the audio — some for compliance, some for training, and some for market research. The questions they’re trying to answer are usually as simple as:

  - “What is the topic of the call?” 
  - “Is this call compliant?” (did I say: my company name, my name, and “this call may be recorded”)
  - “Are people getting their problems solved quickly?” 
  - “Do my agents need training?” 
  - “What are our customers talking about? Competitors? Our latest marketing campaign?”

It’s the most intimate view you can get on your customers, but the problem is so large and difficult to solve that companies pushed it into the corner over the past couple decades, only trying to mitigate the bleeding. Current tools only transcribe with around 50-60% accuracy on real-world, noisy, accented, industry-specific audio (don’t believe the ‘human level accuracy’ hype). When companies start solving problems using speech data, they first want transcription that’s accurate. After accuracy, comes scale — another big problem. Speech processing is computationally expensive and slow. Imagine trying to get into an iterative problem solving loop when you have to wait 24 hours to get your transcripts back.

So we’ve set our sights on building the speech company. Competition from companies like Google, Amazon, and Nuance is real, but none of these approach speech recognition like we do. We've rebuilt the entire speech processing stack, replacing heuristics and stats based speech processing with fully end-to-end deep learning (we use CNNs and RNNs). Using GPUs, we train speech models to learn customer’s unique vocabularies, accents, product names, and acoustic environments. This can be the difference between correctly capturing “wasn’t delivered” and “was in the liver.” We’ve focused on speed since we think that’s very important for exploration and scale. Our API returns hour-long transcripts interactively in seconds. It’s a tool many businesses wish they had.

So far we’ve released tools that:

  - transcribe speech with timestamps
  - support real-time streaming
  - have multi-channel support
  - understand multiple languages (in beta now)
  - allow you to deeply search for keywords and phrases
  - transcribe to phonemes
  - get more accurate with use
Some of those are better mousetraps of things you’re familiar with and some are completely new levers to pull in your audio data. We’ve built the core on English but now we’re releasing the tools for all of the Americas. (aside: You can transfer learn speech and it works well!)

Accuracy will continue to improve for transcription, but I think we can do more. It's such a large problem, and we really want to make a dent in “solving speech”. That means asking, truly: “What can a human do?“

People can, with little context, jump into a conversation and determine:

  - What are the words? When are they said? Who said what?
  - Is this person young/old? Male/Female? Exhausted/energetic?
  - Where is there confusion?
  - What language are they speaking? What’s the speaker’s accent?
  - What’s the topic of the conversation? Small talk or real? Is it going well?
Some of those things are being worked on now: additional language support, language and accent detection, sentiment analysis, auto-summarization, topic modeling, and more.

We’d love to hear your feedback and ideas.



Get Top 5 Posts of the Week



best of all time best of today best of yesterday best of this week best of this month best of last month best of this year best of 2023 best of 2022 yc w24 yc s23 yc w23 yc s22 yc w22 yc s21 yc w21 yc s20 yc w20 yc s19 yc w19 yc s18 yc w18 yc all-time 3d algorithms animation android [ai] artificial-intelligence api augmented-reality big data bitcoin blockchain book bootstrap bot css c chart chess chrome extension cli command line compiler crypto covid-19 cryptography data deep learning elexir ether excel framework game git go html ios iphone java js javascript jobs kubernetes learn linux lisp mac machine-learning most successful neural net nft node optimisation parser performance privacy python raspberry pi react retro review my ruby rust saas scraper security sql tensor flow terminal travel virtual reality visualisation vue windows web3 young talents


andrey azimov by Andrey Azimov