Google released its first neural network-powered voice recognition system in 2011 and it’s been slowly improving ever since. But now, it’s announced that the addition of recurrent neural networks will make the tech much faster — and much more accurate.
In a blog post, the Google Speech Team explains that it’s added what are known as Connectionist Temporal Classification and sequence discriminative training techniques to its algorithms. If that doesn’t make much sense to you, here’s a straightforward explanation of how it works:
In a traditional speech recogniser, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analysed for its frequency content, and the resulting feature vector is passed through an acoustic model... The recogniser then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example - /m j u z i @ m/ in phonetic notation - it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recogniser doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.
Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that.
By introducing that ability to include information about sounds on either side of each snippet, the algorithms stands a far better chance of understanding what you say. In fact, Google claims that it makes voice search far more accurate, particularly in noisy environments, as well as helping to make it “blazingly fast”.