Google WaveNet sound and speech synthesis

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

stratum wrote:
The teams using Deep Learning for speech recognition do not use yet unprocessed audio samples, but a smaller set of descriptors based on a FFT.
I do not know about deep learning based speech recognition, but more classical approaches use feature descriptiors based on mel frequency cepstrum. That's something like fft of log of fft of speech, aligned to human ear sensitivities (not exactly, as the second step is a dct). More info can be found here https://en.wikipedia.org/wiki/Mel-frequency_cepstrum https://en.wikipedia.org/wiki/Cepstrum
Yes that's the inputs they use for their neural networks.
Checkout our VST3/VST2/AU/AAX/LV2:
Inner Pitch | Lens | Couture | Panagement | Graillon

Post

There are a few ways in which to tag speech. There also needs to be some context based learning for natural pauses and pitch changes. Something like Markov Chains may be used for that. All in all, it's a massive undertaking in reading analysis.
Incidently, Microsoft recently released it's deep learning suite as Open Source:-
http://blogs.microsoft.com/next/2016/01 ... 11069z01c8

Dave H.

Post

Incidently, Microsoft recently released it's deep learning suite as Open Source:-
http://blogs.microsoft.com/next/2016/01 ... 11069z01c8
It's interesting that many are following the same approach once it is proven to work.
Do you know an introductory paper that describes how these neural nets are integrated into a decoder? It's not difficult to see how an HMM based Viterbi-seach would work, but there are many ways a neural network could be used and it's not obvious as to why one would be preferred over another.

Thanks
~stratum~

Post

Which is exactly why neural network are such a bad idea in decision making: you don't know why they came up with the conclusion :/

Post

I have found one possible integration scheme but it does not explain anything https://github.com/opendcd/opendcd.gith ... alkthrough

It's just a shell script that connects a neural net to an wfst based decoder.

What it does is not clear to me, as once we have a wfst I do not see why one would need anything extra. To me it looks like opendcd could work alone without needing any neural net. Are we supposed to use the neural nets to derive a probabilistic assignment of feature vectors into the input alphabet symbols? What about the weights? https://en.wikipedia.org/wiki/Finite-st ... d_automata Since a wfst itself is not a neural net then apparently we do not get those 'weights' from a neural net training algorithm, so that's something else (derived form a probabilistic language model, probably). Then apparently we are supposed to use a neural net to convert a feature vector to a symbol of the input alphabet (that's the only option left). Why that's not done with say, K-means clustering or a gausian mixture model, what would be the advantage of a neural net, or a "deep" one, specifically?
~stratum~

Post

incredible stuff these guys can do...

Post

incredible stuff these guys can do...
It's a problem complicated by the size of the required training data (manual labelling is infeasible as it is labor intensive), the size of search space during decoding (combinatorial explosion) and the variability of the so called 'basic' phonemes within context (they are influenced by the neighboring phonemes). A combination of these made it one of the most costly research fields ever:)
~stratum~

Post

Miles1981 wrote:Which is exactly why neural network are such a bad idea in decision making: you don't know why they came up with the conclusion :/
But isn't that the same case for humans ;)

Post

highkoo wrote:So I guess our inevitable AI robot overlords wont have to sound so cold and detached at least. :)
They won't have to... but they will. Just to intimidate us.

Post

Not really. There is a small fraction of the population that thinks like that, but the enterprise world is filled with the kind that requires the reasons that made the decision (65%, with all managers, against 35%, with all the geniuses/wizards...). As such the small fraction has to come up with a way of decoding their thought process to make the decision on basis that the managers/markets/... can understand.
As such, neural networks don't allow that process.

Post

That reasoning aside, to my 'uneducated' guessing ability it looks like there is no way to make a guided non-exhaustive search through a very large search space using a neural network so the actual speech decoder cannot be a neural net. It's likely that it just replaces a small part of the whole system and in that 'small' part the behavior, advantages or disadvantages of a neural net can probably be explained.

It would be very interesting to see how that search was made, if I'm wrong.
~stratum~

Post

Miles1981 wrote:Not really. There is a small fraction of the population that thinks like that, but the enterprise world is filled with the kind that requires the reasons that made the decision (65%, with all managers, against 35%, with all the geniuses/wizards...). As such the small fraction has to come up with a way of decoding their thought process to make the decision on basis that the managers/markets/... can understand.
As such, neural networks don't allow that process.
In that respect, sure, I thought you just mean the not knowing how the neural net actually worked.

Though that's not to say that AI couldn't one day offer a reason too.

Post

The reason can be given by other machine learning tools. The thing is, they also require more experience to use them than raw CPU/GPU power :p

Post

Some cool people trying to implement the wavenet algorithm here:
https://github.com/ibab/tensorflow-wavenet
Looks quite promising.

Post Reply

Return to “DSP and Plugin Development”