VoiceMod this week acquired the four-person startup Vocto Labs. Despite its small headcount, Vocto Labs has a big profile because of its creation of Holly+, the first synthetic singer on Spotify. Vocto technology appears to be behind the new Text to Song product singing voices.
VoiceMod’s new Text to Song is marketed as “the first generative music online tool.” The user selects a song from a list of a few Christmas tunes and two other options. Next, the user selects from among seven singing voices and enters the lyrics for the song. Text to Song then generates the composition.
A key theme of our Synthedia 2 conference was the rise of production-quality use cases. From virtual newscasters and synthetic podcast hosts, what was a novelty two years ago, is being used in regular media production today.
A Creative Tool for Everyone
Those production-quality use cases highlighted during Synthedia 2 were mostly AI acting as a co-pilot for experts. VoiceMod’s Text to Song attempts to bring production-quality synthetic media within easy reach of non-experts. It is not a tool for professionals. It is for the masses.
Most of the AI-based music and audio generation solutions are trending toward this type of configuration-first solution. You are not required to describe what you want in natural language or learn how to employ a panel of buttons and dials. Instead, you click a couple of selection buttons, and the music generates. VoiceMod doesn’t offer any music customization but does have the singer and lyric options.
Granted, you can’t add many lyrics. The song clips available today range from 9 to 27 seconds, and there is lead-in music with no singing. So, plan on 10-30 words.
VoiceMod says in its announcement that it plans to add more song and singer choices in the future. Although the videos have very few views so far, VoiceMod has a Discord community of over 200,000 and a sound catalog exceeding 20,000 items, so my expectation is that they will look to make this solution scale. You can try it out here.
Is This Generative AI?
The text-to-x phenomenon has also revolutionized synthetic media. Also known as generative AI, text-to-x uses concepts expressed in natural language text prompts to generate novel outputs. These AI models generate text, images, audio, and video in a few seconds that would take a human many minutes, hours, or days to match. This is the fastest growing synthetic media segment by a wide margin.
So, you can imagine there is an incentive to classify a new product as text-to-x to ride that mindshare momentum. A fair question is whether this is generative AI. It is generative in that you type in lyrics that are then added to a song and performed by a synthetic voice generation of those lyrics. However, it is not generative AI that operates in the same way that large language models, text-to-image models, and music generators do, where outputs are novel, emergent, and unknown before generation.
The only novel element of the Text to Song output is the lyrics. Everything else is known. But that may be enough to create significant use and enthusiasm. Text to Song has the potential to be fun and funny, and it is easy enough for anyone to use.
In addition, with the introduction of yet another mass-market text-to-x solution, consumers are becoming more familiar with AI-based products. That will aid in adoption across the synthetic media landscape. More familiarity will increase adoption.