We (Mark and Justin) started writing music together a few years ago but felt limited in our ability to create anything that we were proud of. Modern music production is highly technical and requires knowledge of sound design, tracking, arrangement, mixing, mastering, and digital signal processing. Even with our technical backgrounds (in AI and cloud computing respectively), we struggled to learn what we needed to know.
The emergence of latent diffusion models was a turning point for us just like many others in tech. All of a sudden it was possible to leverage AI to create beautiful art. After meeting our cofounder Diandre (half of the DJ duo Bandlez and expert music producer), we formed a team to apply generative AI to music production.
We began by focusing on generating music samples rather than full songs. Focusing on samples gave us several advantages, but the biggest one was the ability to build and train our custom models very quickly due to the small required length of the generated audio (typically 2-10 seconds). Conveniently, our early text-to-sample model also fit well within many existing music producers’ workflows which often involve heavy use of music samples.
We ran into several challenges when creating our text-to-sound model. The first was that we began by training our latent transformer (similar to Open AI’s Sora) using off-the-shelf audio autoencoders (like Meta’s Encodec) and text embedders (like Google’s T5). The domain gap between the data used to train these off-the-shelf models and sample data was much greater than we expected, which caused us to incorrectly attribute blame for issues in the three model components (latent transformer, autoencoder, and embedder) during development. To see how musicians can use our text-to-sound generator to write music, you can see our text-to-sound demo below:
https://www.youtube.com/watch?v=MT3k4VV5yrs&ab_channel=Sound...
The second issue we experienced was more on the product design side. When we spoke with our users in-depth we learned that novice music producers had no idea what to type into the prompt box, and expert music producers felt that our model’s output wasn’t always what they had in mind when they typed in their prompt. It turns out that text is much better at specifying the contents of visual art than music. This particular issue is what led us to our new product: the Infinite Sample Pack.
The Infinite Sample Pack does something rather unconventional: prompting with audio rather than text. Rather than requiring you to type out a prompt and specify many parameters, all you need to do is click a button to receive new samples. Each time you select a sound, our system embeds “prompt samples” as input to our model which then creates infinite variations. By limiting the number of possible outputs we’re able to hide inference latency by pre-computing lots of samples ahead of time. This new approach has seen much wider adoption and so this month we’ll be opening the system up so that everyone can create Infinite Sample Packs of their very own! To compare the workflow of the two products, you can check out our new demo using the Infinite Sample Pack:
https://www.youtube.com/watch?v=BqYhGipZCDY&ab_channel=Sound...
Overall, our founding principle is to start by asking the question: "what do musicians actually want?" Meta's open sourcing of MusicGen has resulted in many interchangeable text-to-music products, but ours is embraced by musicians. By constantly having an open dialog with our users we’ve been able to satisfy many needs including the ability to specify BPM and key, including one-shot instrument samples (so musicians can write their own melodies), and adding drag-and-drop support for digital audio workstations via our desktop app and VST. To hear some of the awesome songs made with our product, take a listen to our community showcases below!
https://soundcloud.com/soundry-ai/sets/community-showcases
We hope you enjoy our tool, and look forward to discussion in the comments