Google WaveNet sound and speech synthesis
-
- KVRist
- Topic Starter
- 107 posts since 28 Aug, 2014
Hi! I'm not into DSP programming, but I stumbled across this and thought it might interest some of you:
https://deepmind.com/blog/wavenet-gener ... raw-audio/
It's a deep neural network developed for improving speech synthesis. The speech synthesis is extremely good, but even more fascinating is that the network can be trained on any input data. Scroll down and check out the piano playing examples. Unreal, though there might be a reason why the examples are so short...
I also love the random babble examples, although possible uses probably still have to be found.
Link to the paper:
https://drive.google.com/file/d/0B3cxcn ... JINDQ/view
https://deepmind.com/blog/wavenet-gener ... raw-audio/
It's a deep neural network developed for improving speech synthesis. The speech synthesis is extremely good, but even more fascinating is that the network can be trained on any input data. Scroll down and check out the piano playing examples. Unreal, though there might be a reason why the examples are so short...
I also love the random babble examples, although possible uses probably still have to be found.
Link to the paper:
https://drive.google.com/file/d/0B3cxcn ... JINDQ/view
-
- KVRian
- 688 posts since 17 Sep, 2007 from Planet Thanet
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
Deep learning allows you to do that indeed. But then, you can't known why something happens. Neural network in all their glory.
-
- KVRian
- 1153 posts since 11 Aug, 2004 from Breuillet, France
I'd like to ask the 1000000$ question : would it have any use to train a deep neural network to make audio effects as well ? Analog modeling stuff too ?
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
Definitely. It would do so and you wouldn't even know how it did it.
The main issue is of course that to do, you need non linear neurons, and these are acheieved through the sigmoid function (i.e. une exponentielle). Avec plusieurs niveaux de prodondeur + une largeur des données d'entrée assez importante, ça risque de coûter cher pour du temps réel.
The main issue is of course that to do, you need non linear neurons, and these are acheieved through the sigmoid function (i.e. une exponentielle). Avec plusieurs niveaux de prodondeur + une largeur des données d'entrée assez importante, ça risque de coûter cher pour du temps réel.
- KVRist
- 310 posts since 21 Oct, 2008 from new england
An interesting concept would be to make a plugin that learns what an individual user prefers. Hence it slightly modifies it algorithms to match a users preference. That would be incredible.Ivan_C wrote:I'd like to ask the 1000000$ question : would it have any use to train a deep neural network to make audio effects as well ? Analog modeling stuff too ?
Hence presets could almost be various user interest profiles.
-
- KVRAF
- 4321 posts since 26 Jun, 2004
-
- KVRian
- 1153 posts since 11 Aug, 2004 from Breuillet, France
I really need to go deep into this topic one day
And i even wait for the day when (Google WaveNet) is even used to do mastering on songs.
I just remembered I have heard some "online automatic mastering applications" use AI already... But they are not that good yet...An interesting concept would be to make a plugin that learns what an individual user prefers. Hence it slightly modifies it algorithms to match a users preference. That would be incredible.
Hence presets could almost be various user interest profiles.
- KVRist
- 296 posts since 1 Apr, 2009 from Hannover, Germany
Well, copywriters can say anything is AI, even just a simple set of static rules.Ivan_C wrote: I just remembered I have heard some "online automatic mastering applications" use AI already... But they are not that good yet...
In the case of automatic mastering, this will be a neural network tweaking some knobs so that certain *measurements* are similar to a reference.
I think the challenge with this is not the artificial intelligence itself (setting up and learning a network isn't too hard), but equipping it with the right "senses", which means which measurements to extract from audio that have some meaning.
-
- KVRer
- 11 posts since 18 May, 2016
The speech modeling is fascinating, obviously, but just think how this conditioning could be applied to the timbre/performance of individual instruments: using the right dataset, it could be made to model, say, classic analog leads, controlled by user preferences to not only offer modifiable sound parameters but adapt performance qualities distinct to that instrument/style (or that of another). A synthesizer that performs like an operatic soprano, a cello that plays like a maraca, or a drum kit that sounds like speech.
-
Guillaume Piolat Guillaume Piolat https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=366815
- KVRist
- 279 posts since 21 Sep, 2015 from Grenoble
The teams using Deep Learning for speech recognition do not use yet unprocessed audio samples, but a smaller set of descriptors based on a FFT.
While their goal is to remove this layer, it's not done yet for performance and trainability reasons.
While their goal is to remove this layer, it's not done yet for performance and trainability reasons.
Checkout our VST3/VST2/AU/AAX/LV2:
Inner Pitch | Lens | Couture | Panagement | Graillon
Inner Pitch | Lens | Couture | Panagement | Graillon
-
- KVRAF
- 2475 posts since 15 Apr, 2004 from Capital City, UK
Me and my incredibly nerdy friend have used RNNs to analyse audio files to 'teach' a neural net how to construct audio data. We even had an super-fast GPU to farm out the calculations to, and it STILL took 48 hours to generate a very strange noise sample. 5 seconds long. We used one engine to analyse a file at the sample level, then we found another engine which used an FFT-aware brain, the output of which was a bit more 'normal'.. still very simple sine wave constructions.
I will see if he has some time this weekend to have a look at this. WOW, those samples nearer the end are brilliantly confusing, maybe we can create some in our lab. It sounds very much like trying to listen to someone talk while having a quite intense LSD rush.
I think I'm going to feed our engine with some kraftwerk.
I will see if he has some time this weekend to have a look at this. WOW, those samples nearer the end are brilliantly confusing, maybe we can create some in our lab. It sounds very much like trying to listen to someone talk while having a quite intense LSD rush.
I think I'm going to feed our engine with some kraftwerk.
-
- KVRAF
- 2256 posts since 29 May, 2012
I do not know about deep learning based speech recognition, but more classical approaches use feature descriptiors based on mel frequency cepstrum. That's something like fft of log of fft of speech, aligned to human ear sensitivities (not exactly, as the second step is a dct). More info can be found here https://en.wikipedia.org/wiki/Mel-frequency_cepstrum https://en.wikipedia.org/wiki/CepstrumThe teams using Deep Learning for speech recognition do not use yet unprocessed audio samples, but a smaller set of descriptors based on a FFT.
~stratum~