Play.ht (YC W23) – Generate and clone voices from 20 seconds of audio

Hey HN, we are Mahmoud and Hammad, co-founders of Play.ht, a text-to-speech synthesis platform. We're building Large Language Speech Models across all languages with a focus on voice expressiveness and control.

Today, we are excited to share beta access to our latest model, Parrot, that is capable of cloning any voice with a few seconds of audio and generating expressive speech from text.

You can try it out here: https://playground.play.ht. And there are demo videos at https://www.youtube.com/watch?v=aL_hmxTLHiM and https://www.youtube.com/watch?v=fdEEoODd6Kk.

The model also captures accents well and is able to speak in all English accents. Even more interesting, it can make non-English speakers speak English while preserving their original accent. Just upload a non-English speaker clip and try it yourself.

Existing text to speech models either lack expressiveness, control or directability of the voice. For example, making a voice speak in a specific way, or emphasizing on a certain word or parts of the speech. Our goal is to solve these across all languages. Since the voices are built on LLMs they are able to express emotions based on the context of the text.

Our previous speech model, Peregrine, which we released last September, is able to laugh, scream and express other emotions: https://play.ht/blog/introducing-truly-realistic-text-to-spe.... We posted it to HN here: https://news.ycombinator.com/item?id=32945504.

With Parrot, we've taken a slightly different approach and trained it on a much larger data set. Both Parrot and Peregrine only speak English at the moment but we are working on other languages and are seeing impressive early results that we plan to share soon.

Content creators of all kinds (gaming, media production, elearning) spend a lot of time and effort recording and editing high-quality audio. We solve that and make it as simple as writing and editing text. Our users range from individual creators looking to voice their videos, podcasts, etc to teams at various companies creating dynamic audio content.

We initially built this product for ourselves to listen to books and articles online and then found the quality of TTS is very low, so we started working on this product until, eventually we trained our own models and built a business around it. There are many robotic TTS services out there, but ours allows people to generate truly human-level expressive speech and allows anyone to clone voices instantly with strong resemblance. We initially used existing TTS models and APIs but when we started talking to our customers in gaming, media production, and others, people didn't like the monotone robotic TTS style. So we doubled down in training a new model based on the new emerging architectures using transformers and self supervised learning.

On our platform, we offer two types of voice cloning: high-fidelity and zero-shot. High-fidelity voice cloning requires around 20 minutes of audio data and creates an expressive voice that is more robust and captures the accent of the target voice with all its nuances. Zero-shot clones the voice with only a few seconds of audio and captures most of the accent and tone, but isn’t as nuanced because it has less data to work with. We also offer a diverse library of over a hundred voices for various use cases.

We offer two ways to use these models on the platform: (1) our text to voice editor, that allows users to create and manage their audio files in projects, etc.; and (2) our API - https://docs.play.ht/reference/api-getting-started. The API supports streaming and polling and we are working on reducing the latency to make it real time. We have a free plan and transparent pricing available for anyone to upgrade.

We are thrilled to be sharing our new model, and look forward to feedback!



Get Top 5 Posts of the Week



best of all time best of today best of yesterday best of this week best of this month best of last month best of this year best of 2024 best of 2023 yc w25 yc s24 yc w24 yc s23 yc w23 yc s22 yc w22 yc s21 yc w21 yc s20 yc w20 yc s19 yc w19 yc s18 yc w18 yc all-time 3d algorithms animation android [ai] artificial-intelligence api augmented-reality big data bitcoin blockchain book bootstrap bot css c chart chess chrome extension cli command line compiler crypto covid-19 cryptography data deep learning elexir ether excel framework game git go html ios iphone java js javascript jobs kubernetes learn linux lisp mac machine-learning most successful neural net nft node optimisation parser performance privacy python raspberry pi react retro review my ruby rust saas scraper security sql tensor flow terminal travel virtual reality visualisation vue windows web3 young talents


andrey azimov by Andrey Azimov