At this point, anyone who has been following AI research is long familiar with generative models that can synthesize speech or melodic music from nothing but text prompting. Nvidia’s newly revealed “Fugatto” model looks to go a step further, using new synthetic training methods and inference-level combination techniques to “transform any mix of music, voices, and sounds,” including the synthesis of sounds that have never existed.
While Fugatto isn’t available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia’s description of Fugatto as “a Swiss Army knife for sound.”
You’re only as good as your data
In an explanatory research paper, over a dozen Nvidia researchers explain the difficulty in crafting a training dataset that can “reveal meaningful relationships between audio and language.” While standard language models can often infer how to handle various instructions from the text-based data itself, it can be hard to generalize descriptions and traits from audio without more explicit guidance.
Leave a Reply