Play.ht (YC W23) – Generate and clone voices from 20 seconds of audio

Read Post

Hey HN, we are Mahmoud and Hammad, co-founders of Play.ht, a text-to-speech synthesis platform. We're building Large Language Speech Models across all languages with a focus on voice expressiveness and control.

Today, we are excited to share beta access to our latest model, Parrot, that is capable of cloning any voice with a few seconds of audio and generating expressive speech from text.

You can try it out here: https://playground.play.ht. And there are demo videos at https://www.youtube.com/watch?v=aL_hmxTLHiM and https://www.youtube.com/watch?v=fdEEoODd6Kk.

The model also captures accents well and is able to speak in all English accents. Even more interesting, it can make non-English speakers speak English while preserving their original accent. Just upload a non-English speaker clip and try it yourself.

Existing text to speech models either lack expressiveness, control or directability of the voice. For example, making a voice speak in a specific way, or emphasizing on a certain word or parts of the speech. Our goal is to solve these across all languages. Since the voices are built on LLMs they are able to express emotions based on the context of the text.

Our previous speech model, Peregrine, which we released last September, is able to laugh, scream and express other emotions: https://play.ht/blog/introducing-truly-realistic-text-to-spe.... We posted it to HN here: https://news.ycombinator.com/item?id=32945504.

With Parrot, we've taken a slightly different approach and trained it on a much larger data set. Both Parrot and Peregrine only speak English at the moment but we are working on other languages and are seeing impressive early results that we plan to share soon.

Content creators of all kinds (gaming, media production, elearning) spend a lot of time and effort recording and editing high-quality audio. We solve that and make it as simple as writing and editing text. Our users range from individual creators looking to voice their videos, podcasts, etc to teams at various companies creating dynamic audio content.

We initially built this product for ourselves to listen to books and articles online and then found the quality of TTS is very low, so we started working on this product until, eventually we trained our own models and built a business around it. There are many robotic TTS services out there, but ours allows people to generate truly human-level expressive speech and allows anyone to clone voices instantly with strong resemblance. We initially used existing TTS models and APIs but when we started talking to our customers in gaming, media production, and others, people didn't like the monotone robotic TTS style. So we doubled down in training a new model based on the new emerging architectures using transformers and self supervised learning.

On our platform, we offer two types of voice cloning: high-fidelity and zero-shot. High-fidelity voice cloning requires around 20 minutes of audio data and creates an expressive voice that is more robust and captures the accent of the target voice with all its nuances. Zero-shot clones the voice with only a few seconds of audio and captures most of the accent and tone, but isn’t as nuanced because it has less data to work with. We also offer a diverse library of over a hundred voices for various use cases.

We offer two ways to use these models on the platform: (1) our text to voice editor, that allows users to create and manage their audio files in projects, etc.; and (2) our API - https://docs.play.ht/reference/api-getting-started. The API supports streaming and polling and we are working on reducing the latency to make it real time. We have a free plan and transparent pricing available for anyone to upgrade.

We are thrilled to be sharing our new model, and look forward to feedback!

Play.ht (YC W23) – Generate and clone voices from 20 seconds of audio

Get Top 5 Posts of the Week