Yes that's the inputs they use for their neural networks.stratum wrote:I do not know about deep learning based speech recognition, but more classical approaches use feature descriptiors based on mel frequency cepstrum. That's something like fft of log of fft of speech, aligned to human ear sensitivities (not exactly, as the second step is a dct). More info can be found here https://en.wikipedia.org/wiki/Mel-frequency_cepstrum https://en.wikipedia.org/wiki/CepstrumThe teams using Deep Learning for speech recognition do not use yet unprocessed audio samples, but a smaller set of descriptors based on a FFT.
Google WaveNet sound and speech synthesis
-
Guillaume Piolat Guillaume Piolat https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=366815
- KVRist
- 279 posts since 21 Sep, 2015 from Grenoble
Checkout our VST3/VST2/AU/AAX/LV2:
Inner Pitch | Lens | Couture | Panagement | Graillon
Inner Pitch | Lens | Couture | Panagement | Graillon
- KVRian
- 872 posts since 6 Aug, 2005 from England
There are a few ways in which to tag speech. There also needs to be some context based learning for natural pauses and pitch changes. Something like Markov Chains may be used for that. All in all, it's a massive undertaking in reading analysis.
Incidently, Microsoft recently released it's deep learning suite as Open Source:-
http://blogs.microsoft.com/next/2016/01 ... 11069z01c8
Dave H.
Incidently, Microsoft recently released it's deep learning suite as Open Source:-
http://blogs.microsoft.com/next/2016/01 ... 11069z01c8
Dave H.
Dave Hoskins. http://www.quikquak.com
-
- KVRAF
- 2256 posts since 29 May, 2012
It's interesting that many are following the same approach once it is proven to work.Incidently, Microsoft recently released it's deep learning suite as Open Source:-
http://blogs.microsoft.com/next/2016/01 ... 11069z01c8
Do you know an introductory paper that describes how these neural nets are integrated into a decoder? It's not difficult to see how an HMM based Viterbi-seach would work, but there are many ways a neural network could be used and it's not obvious as to why one would be preferred over another.
Thanks
~stratum~
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
Which is exactly why neural network are such a bad idea in decision making: you don't know why they came up with the conclusion :/
-
- KVRAF
- 2256 posts since 29 May, 2012
I have found one possible integration scheme but it does not explain anything https://github.com/opendcd/opendcd.gith ... alkthrough
It's just a shell script that connects a neural net to an wfst based decoder.
What it does is not clear to me, as once we have a wfst I do not see why one would need anything extra. To me it looks like opendcd could work alone without needing any neural net. Are we supposed to use the neural nets to derive a probabilistic assignment of feature vectors into the input alphabet symbols? What about the weights? https://en.wikipedia.org/wiki/Finite-st ... d_automata Since a wfst itself is not a neural net then apparently we do not get those 'weights' from a neural net training algorithm, so that's something else (derived form a probabilistic language model, probably). Then apparently we are supposed to use a neural net to convert a feature vector to a symbol of the input alphabet (that's the only option left). Why that's not done with say, K-means clustering or a gausian mixture model, what would be the advantage of a neural net, or a "deep" one, specifically?
It's just a shell script that connects a neural net to an wfst based decoder.
What it does is not clear to me, as once we have a wfst I do not see why one would need anything extra. To me it looks like opendcd could work alone without needing any neural net. Are we supposed to use the neural nets to derive a probabilistic assignment of feature vectors into the input alphabet symbols? What about the weights? https://en.wikipedia.org/wiki/Finite-st ... d_automata Since a wfst itself is not a neural net then apparently we do not get those 'weights' from a neural net training algorithm, so that's something else (derived form a probabilistic language model, probably). Then apparently we are supposed to use a neural net to convert a feature vector to a symbol of the input alphabet (that's the only option left). Why that's not done with say, K-means clustering or a gausian mixture model, what would be the advantage of a neural net, or a "deep" one, specifically?
~stratum~
- KVRian
- 1045 posts since 3 Jul, 2006
incredible stuff these guys can do...
-
- KVRAF
- 2256 posts since 29 May, 2012
It's a problem complicated by the size of the required training data (manual labelling is infeasible as it is labor intensive), the size of search space during decoding (combinatorial explosion) and the variability of the so called 'basic' phonemes within context (they are influenced by the neighboring phonemes). A combination of these made it one of the most costly research fields ever:)incredible stuff these guys can do...
~stratum~
- KVRist
- 168 posts since 19 Apr, 2014 from London
But isn't that the same case for humansMiles1981 wrote:Which is exactly why neural network are such a bad idea in decision making: you don't know why they came up with the conclusion :/
- KVRAF
- 4656 posts since 1 Aug, 2005 from Warszawa, Poland
They won't have to... but they will. Just to intimidate us.highkoo wrote:So I guess our inevitable AI robot overlords wont have to sound so cold and detached at least.
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
Not really. There is a small fraction of the population that thinks like that, but the enterprise world is filled with the kind that requires the reasons that made the decision (65%, with all managers, against 35%, with all the geniuses/wizards...). As such the small fraction has to come up with a way of decoding their thought process to make the decision on basis that the managers/markets/... can understand.
As such, neural networks don't allow that process.
As such, neural networks don't allow that process.
-
- KVRAF
- 2256 posts since 29 May, 2012
That reasoning aside, to my 'uneducated' guessing ability it looks like there is no way to make a guided non-exhaustive search through a very large search space using a neural network so the actual speech decoder cannot be a neural net. It's likely that it just replaces a small part of the whole system and in that 'small' part the behavior, advantages or disadvantages of a neural net can probably be explained.
It would be very interesting to see how that search was made, if I'm wrong.
It would be very interesting to see how that search was made, if I'm wrong.
~stratum~
- KVRist
- 168 posts since 19 Apr, 2014 from London
In that respect, sure, I thought you just mean the not knowing how the neural net actually worked.Miles1981 wrote:Not really. There is a small fraction of the population that thinks like that, but the enterprise world is filled with the kind that requires the reasons that made the decision (65%, with all managers, against 35%, with all the geniuses/wizards...). As such the small fraction has to come up with a way of decoding their thought process to make the decision on basis that the managers/markets/... can understand.
As such, neural networks don't allow that process.
Though that's not to say that AI couldn't one day offer a reason too.
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
The reason can be given by other machine learning tools. The thing is, they also require more experience to use them than raw CPU/GPU power :p
-
- KVRist
- 48 posts since 1 Jan, 2006 from germany
Some cool people trying to implement the wavenet algorithm here:
https://github.com/ibab/tensorflow-wavenet
Looks quite promising.
https://github.com/ibab/tensorflow-wavenet
Looks quite promising.