KVR Audio

samesung · Post by **samesung** » Fri Oct 06, 2023 10:56 am

I'm totally new at this so I'm looking for the best way to get started. So far I've only encountered retrieval-based voice conversion (RVC) so I'm assuming that is what I'm talking about, but I'm also interested if there are other models.

I need information on:
1. What software would you recommend to use for a) changing own voice to someone else's; b) text to voice recording. What are pros and cons of the available options? Is there a program that can be downloaded and do it locally instead of cloud/online-based stuff? Paid vs free options? Is it possible to do the voice conversion in real time or near real time?
2. AI training guides - I saw a couple of videos, but I guess it depends on what program I'll be using.
3. Hardware configurations - I have a solid PC for audio production, and I'm about to purchase a GPU. I saw some of these processes use GPU for audio, so I'm wondering what is considered good enough - I'm looking at RTX 4060 (Ti) with 16GB VRAM, is that ok for local AI training?
4. Where do you keep up to date with the latest news and developments from that field?

Cheers!

whyterabbyt · Post by **whyterabbyt** » Fri Oct 06, 2023 10:57 am

You'd be better off at a specialist AI site. https://huggingface.co/

BertKoor · Post by **BertKoor** » Fri Oct 06, 2023 2:04 pm

q1 & 2: just google "voice deep fake"

jules99 · Post by **jules99** » Fri Oct 06, 2023 2:13 pm

To answer all of your questions in one: there isn't a VST doing this locally yet, so you don't need a software or a powerful computer just yet. As of right now, you HAVE to do it online.

When this recent video from Benn Jordan on voice-swap, one of the sites doing this, premiered, he hinted in the chat that he was already testing a VST version of this: https://www.youtube.com/watch?v=Xy3xmpmGgaA

But it remains to be seen if this version is offline, or if it will need to upload your vocals to a server, change them there. And how much that is going to cost.

There used to be a huge Discord server (AI Hub, I think) where you would just upload our acapella and then choose any celebrity singer to interchange it (like it was described here: https://www.youtube.com/watch?v=-lqg-xc6BT0). But that server got shut down recently through a DMCA request. A new, much smaller Discord server has been set up, but its not nearly as big and versatile in its singer choices.

And if you want to use YOUR voice to perfectly sing, say, "All I want for Christmas", take a look at Controlla.XYZ. Again a website, unfortunately.

samesung · Post by **samesung** » Sat Oct 07, 2023 9:33 am

Thank you for the replies. I found the huggingface website a bit too confusing to use at first, it's really huge and covers just about every AI learnign topic, from language models to images and audio. It looks like a great repository of models though.

I'll keep an eye out for Benn Jordan's VST. I really prefer to use this stuff locally. Also, it feels like all the online websites charge for something I should be able to use myself for free.

Junolab · Post by **Junolab** » Sat Oct 07, 2023 12:09 pm

samesung wrote: ↑Sat Oct 07, 2023 9:33 am Thank you for the replies. I found the huggingface website a bit too confusing to use at first, it's really huge and covers just about every AI learnign topic, from language models to images and audio. It looks like a great repository of models though.

I'll keep an eye out for Benn Jordan's VST. I really prefer to use this stuff locally. Also, it feels like all the online websites charge for something I should be able to use myself for free.

Why should you be able to use it for free? IMO all options currently still sucks. Use it if you find AI fun but not if you actually want to create music.

whyterabbyt · Post by **whyterabbyt** » Sat Oct 07, 2023 12:43 pm

samesung wrote: ↑Sat Oct 07, 2023 9:33 am Also, it feels like all the online websites charge for something I should be able to use myself for free.

yup you should; all you need to do is fire up your own cluster of a few hundred machines on AWS, and train a model based on your many hundred terrabytes of data. should be zilch costs there.

oh wait. did you actually mean 'it feels like all the online websites charge for something that cost them a small fortune to develop, but I should be able to use myself for free'

samesung · Post by **samesung** » Sat Oct 07, 2023 1:42 pm

Exactly, I want to do it for fun, not for any serious production.

I see people use stuff like stable diffusion locally in order to generate art on their own PCs so I assumed something like that exists for audio. I don't mean to belittle the work people put into developing all that, you misunderstood me. What I mean is that since many people are involved in the creation and training of different programs and models and put them online for free, like on huggingface, and given such infrastructure already exists, in some portion is available for free and is also applied in other fields of AI generation, I figured I should probably be able to do it myself somehow. If someone wants to charge for their software solution, they are free to do so, of course. But I'm also free to explore what options are available.

whyterabbyt · Post by **whyterabbyt** » Sat Oct 07, 2023 2:10 pm

its not for nothing that stable diffusion is the only 'free' example that most people can name. and i imagine that it wouldnt be that way except for the fact that the company behind it, stable.ai actually do that 'online websites that charge' thing for their other ai services, like stable audio.

vurt · Post by **vurt** » Sat Oct 07, 2023 6:22 pm

whyterabbyt wrote: ↑Sat Oct 07, 2023 2:10 pm its not for nothing that stable diffusion is the only 'free' example that most people can name. and i imagine that it wouldnt be that way except for the fact that the company behind it, stable.ai actually do that 'online websites that charge' thing for their other ai services, like stable audio.

im tempted to set up a website offering voice replacement for vocal tracks, just name the artist then my "ai" will do the work!
then record the track with me doing an impression of the artist, in a vic reeves club stylee.

Vortifex · Post by **Vortifex** » Sat Oct 07, 2023 8:30 pm

Emvoice is pretty good for text-to-vocals: https://emvoiceapp.com/

Voice.ai is good for realtime manipulation of voice: https://voice.ai/

rACatkvr · Post by **rACatkvr** » Sun Oct 08, 2023 3:41 am

You might also have a look a Synthesizer V at https://dreamtonics.com/synthesizerv/.

kidslow · Post by **kidslow** » Sun Oct 08, 2023 4:32 am

rACatkvr wrote: ↑Sun Oct 08, 2023 3:41 am You might also have a look a Synthesizer V at https://dreamtonics.com/synthesizerv/.

Looks like Yamaha's vocaloid . Those "singing synths" are built on the model of a "voice pack" with a unique voice that is recorded and broken down into phonemes. You then manipulate the audio samples using an interface similar to Melodyne. They still sound like singing robots, but the quality improves with each iteration. There is a whole rabbit hole niche of videos on Youtube.

OTOH Spotify is training machine learning to translate the voices of podcast hosts automatically into a select few other languages, with a same sounding intonation as the original. Heavy computing against a single known voice, doing a discrete translation task, not to sing but just spoken word. That is the state of the art today.

None of this is the same as modeling the singing voice of some known entity, karaoke'd on top of your own voice as OP seems to be seeking, but that convergence will continue because humans. However don't be fooled by how close we are to this possible future.

We could be %98 of the way there, with the final 2% taking another 20-50-100 years. Perhaps we'll blow ourselves back to the dark ages before we get there?

kidslow · Post by **kidslow** » Sun Oct 08, 2023 5:13 am

jules99 wrote: ↑Fri Oct 06, 2023 2:13 pm To answer all of your questions in one: there isn't a VST doing this locally yet, so you don't need a software or a powerful computer just yet. As of right now, you HAVE to do it online.

When this recent video from Benn Jordan on voice-swap, one of the sites doing this, premiered, he hinted in the chat that he was already testing a VST version of this: https://www.youtube.com/watch?v=Xy3xmpmGgaA

But it remains to be seen if this version is offline, or if it will need to upload your vocals to a server, change them there. And how much that is going to cost.

There used to be a huge Discord server (AI Hub, I think) where you would just upload our acapella and then choose any celebrity singer to interchange it (like it was described here: https://www.youtube.com/watch?v=-lqg-xc6BT0). But that server got shut down recently through a DMCA request. A new, much smaller Discord server has been set up, but its not nearly as big and versatile in its singer choices.

And if you want to use YOUR voice to perfectly sing, say, "All I want for Christmas", take a look at Controlla.XYZ. Again a website, unfortunately.

Something dodgy about both those videos. The first one feels like I'm being sold snake oil and the second has a comment section that reads like an astroturf campaign. Very oversold and underwhelming. The comments to that second video read like fiverr meets badly written amazon reviews. That's where all the AI work went, into generating them. lol

Ou_Tis · Post by **Ou_Tis** » Sun Oct 08, 2023 11:46 am

kidslow wrote: ↑Sun Oct 08, 2023 4:32 am
rACatkvr wrote: ↑Sun Oct 08, 2023 3:41 am You might also have a look a Synthesizer V at https://dreamtonics.com/synthesizerv/.
Looks like Yamaha's vocaloid . Those "singing synths" are built on the model of a "voice pack" with a unique voice that is recorded and broken down into phonemes. You then manipulate the audio samples using an interface similar to Melodyne. They still sound like singing robots, but the quality improves with each iteration. There is a whole rabbit hole niche of videos on Youtube.

OTOH Spotify is training machine learning to translate the voices of podcast hosts automatically into a select few other languages, with a same sounding intonation as the original. Heavy computing against a single known voice, doing a discrete translation task, not to sing but just spoken word. That is the state of the art today.

None of this is the same as modeling the singing voice of some known entity, karaoke'd on top of your own voice as OP seems to be seeking, but that convergence will continue because humans. However don't be fooled by how close we are to this possible future. We could be %98 of the way there, with the final 2% taking another 20-50-100 years. Perhaps we'll blow ourselves back to the dark ages before we get there?

No, Synthesizer V is based on neural networks and can sound extremely realistic.

https://www.youtube.com/watch?v=MTiDN08F10w&t=11s

https://www.youtube.com/watch?v=oT-dMeapZSY

https://www.youtube.com/watch?v=22XQKyh2xBc

https://www.youtube.com/watch?v=1bUv3aPMC20

Very soon they're going to add a feature to convert audio (including onset timing, pitch curve, and lyrics) to their native format. There's going to be a full demo of these new features sometime during Music China (October 11-14).

Voice AI - total beginner