Invented by Alexander H. Gruenstein, Google LLC
The Google LLC invention works as followsThe patent describes “methods, systems and apparatus including computer programs on computer storage media for training a neural network.” The method includes creating a number of feature vectors, each of which models a different part of an audio waveform. It then generates a posterior probability for the first feature vector, using a neural network.
Background for Training Multiple Neural Networks with Different Accuracy
Mobile devices use automatic speech recognition. This technology has a common goal of allowing users to interact with their devices using voice commands. It may be useful to recognize a “hotword” “This signal indicates that the mobile should be activated when it is in sleep mode.
The subject matter of this specification includes a number of innovative aspects. One of these can be found in methods which include receiving a digital representation for speech, generating multiple feature vectors which each model a portion of an audio signal from the digital representation for speech over a period of time different than that of the previous feature vector, generating first posterior probabilities for the initial feature vector by using a neural network and determining if one of those scores is This aspect can also be embodied in computer systems, computer programs, or apparatuses that are recorded on one of more computer storage devices and configured to perform actions. Software, firmware, hardware or a combination can be installed on a system of one or multiple computers to configure it to perform certain operations or actions. “One or more computer programs may be configured to perform specific operations or actions through the inclusion of instructions that, upon execution by data processing equipment, cause the apparatus perform the actions.
The foregoing embodiments and others can optionally include any or all of the following features alone or in combination. The method can include storing the feature vector of the first neural network in a memory and transferring the feature vector to the second neuronal network when it is determined that one score in the posterior probability vector matches the threshold value. In response to determining the same index locations in a series of fourth posterior probability vectors meet the second threshold value, the method can include taking an action. The method can include generating a second posterior probability for each feature vector using the neural network and determining if one of scores in the second posterior probability vectors satisfy the threshold value. This is done until the first anterior handling module determines none of scores in a particular third posterior probabilities vector satisfy the threshold value. The second neural net may generate the third posterior probability matrix and the second anterior handling module can determine whether one score in the third posterior probability is equal to the threshold value. This may be done for each feature vector until the first posterior module determines there are no scores that meet the threshold value.
In some implementations the second network receives the successive feature vectors either from the front-end module for feature extraction or the first network. The method can include identifying a clock frequency that will be used by a processor for the generation of a first posterior probability vector using the feature vector generated by the first neural net. The processor can be a digital-signal processor. “The first neural network can identify the predetermined clock frequency.
In some implementations, the false positive rate of the first neural net is higher than that of the second neural net. The second posterior handling modules and the first posterior handling modules may be the exact same posterior handling modules. First threshold value and second threshold value can be decimal values ranging from zero to one. The first threshold may be lower than the second threshold. The first threshold may be 0.1, and the second threshold may be 0. The second neural net may be more accurate than first neural network.
The subject matter described herein can be embodied as methods which include actions such as training a neural system to identify a group of features with a training set that includes a number of nodes. Training a neural net to identify the same set of feature using a training set that contains a larger number of nodes than the training set used by the first network. Providing the first and second neural networks to a device that analyzes a data and determines whether it has a digital representation for a feature in the set. This aspect can also be embodied in computer systems, computer programs, or apparatuses that are recorded on one computer storage device. Software, firmware, hardware or a combination can be installed on a system of one or multiple computers to configure it to perform certain operations or actions. “One or more computer programs may be configured to perform specific operations or actions through the inclusion of instructions that, upon execution by data processing equipment, cause the apparatus perform the actions.
The following features can be included in the embodiments described above, either individually or in combination. The set of features can include key phrases and key words. The user device can use the first and second neural networks to analyze audio waveforms and determine if a digital representation is present of any of the key words and key phrases in the set of features. The method can include providing the user with a feature extractor module, a posterior handling first module and a posterior handling second module. The user device then uses the feature extractor module, the posterior handling first module and the posterior handling second module to analyze the data set. The first posterior module and the two posterior modules may be the exact same posterior module.
In some implementations the features include computer vision, handwriting recognition, text classification, or authentication features. The first and second training sets may be the exact same set. The method can include training the neural network of the first neural system for a certain number of iterations and the neural network of the second neural for an additional quantity that is greater than the initial quantity. The ratio between the number of nodes in the first and second nodes can be used to identify the performance cost savings that the user device will experience when it analyzes certain portions of the data with the first and not the other neural networks.
The subject matter described herein can be implemented according to certain embodiments in order to achieve one or more advantages. In some implementations the use of a coarse neural network to analyze feature vectors initially and then a deeper neural network to analyse feature vectors the coarse neural network has determined meet a threshold of relevance can reduce CPU usage, power consumption and/or bandwidth usage.
The drawings and description below provide details on one or more embodiments. The description, drawings and claims will reveal other features, aspects and advantages.
A coarse deep neural net is trained using a set of training feature vectors which model audio waveforms for words or subword units. This allows the network to determine a probability of a feature vector matching a key phrase represented by words or subword units. A second deep neural network with hidden nodes is trained using the same set of training feature vectors. This will identify a probability that a feature vector corresponds to a particular key phrase represented by the words or sub-word units.
The other deep network is more accurate that the coarse deep network, because it has more nodes. It can also be trained with more iterations. The coarse deep neural net, for example, may have a greater false positive rate.
A system can use the coarse deep network to analyze a feature vector as an initial step, reducing the need for the other deep network. Only feature vectors identified as relevant by the coarse deep network are analysed by the other deeper neural network.
FIG. “FIG. ?Okay Froogle,? “Okay Froogle” and “Okay John.” “For example, the feature extraction module 102 can receive three digital representations at different times and extract features from each digital representation to produce feature vectors.
The speech recognition system 100 analyzes the feature vectors generated using the coarse deep network 104 in order to determine if the digital representations include phrases that are similar to the key phrase. The coarse deep neural networks 104 and 108 are not as accurate, so they may produce posterior probability vectors which indicate that both the feature vectors of ‘Okay Froogle? The feature vectors for both?Okay Froogle’ and?Okay Google? The key phrase “Okay Google” has a high likelihood of being matched with?Okay Froogle? Another posterior probability vector indicates ‘Okay John’ has a low probability of corresponding with the key phrase. “Okay John” has a very low probability of matching the key phrase.
The speech recognition system 100 sends the posterior probabilities generated by the coarse neural network to a posterior processing module 106. This module uses the values of the posterior probability to determine if the digital representation of speech includes any of the key phrase for which the coarse neural network is trained. The posterior handling module 106 in this example determines that input B, “Okay Froogle,?” and input A, “Okay Google,” are key phrases for which the coarse deep neural network 104 is trained. In this example, the posterior handling module 106 determines that input A, ‘Okay Google,’ and input B, “Okay Froogle,” have high probabilities of including the key phrase?Okay Google? The key phrase “Okay Google” is likely to be included in input B, “Okay Froogle.” The coarse deep neural networks 104 were trained for inputs C, “Okay John”, and the inputs A, “Goodbye”, had a low probability of including the key phrase. The input C, “Okay John,” has a very low probability that it will include the key phrase. It is not similar to other key phrases that the coarse deep neural networks 104 has been trained for.
The deep neural network 108 is more accurate, so it generates second posterior probability vectors. These vectors represent a high likelihood that input A,?Okay Google,? will include a keyphrase or a part of a phrase. The deep neural networks 108 are more accurate and generate second posterior probability vectors which represent a high likelihood that the feature vectors of input A, “Okay Google” will be included in the key phrase. The feature vectors of the key phrase “Okay Google” correspond to those generated by the deep neural network 108. Third posterior probability vectors show a low likelihood that input B’s feature vectors, “Okay Froogle,” correspond with the key phrase?Okay Google,? The feature vectors of the key phrase “Okay Froogle” are similar to the third posterior probability vectors.
The deep neural network 108 sends the second- and third- posterior probability vectors, to a posterior handling module 110. This module analyzes these values to determine if the digital representations of the speech include the key phrase. When determining if a digital representation of speech contains a keyphrase, the other posterior handling module may analyze sequential posterior probability vectors from the deep neural networks 108.
The use of the coarse neural network 104 for a first-pass analysis of the feature vectors reduces the amount of analysis by the deep network 108. The coarse deep neural networks 104, for example, may determine that approximately 2% of feature-vectors correspond to audio with a high probability of containing a key phrase. Therefore, the deep neural networks 108 only needs to analyze these determined feature-vectors and not all feature-vectors. The coarse deep network 104 has fewer nodes than the deep network 108. This means that the network requires less computing power, such as less CPU usage.
In cases where the coarse deep network 104 is used on a remote device from the device that executes the deep network 108, it may be possible to reduce the bandwidth of the network. Only feature vectors that are identified as high-probability to correspond with a keyphrase by the coarse deep network 104 will be sent to the remote device by the deep network 108 for further analysis.
A user device can use the coarse deep network 104 and deep neural network to analyze audio waveforms received and determine if the sequence of frames in an audio waveform includes a digital representation for one of the specific key phrases or keywords that both deep networks were trained on. The user device can perform an action when the deep neural networks 108 determines whether a sequence contains a representation of a specific keyword or key phrase, or has a probability that meets a threshold probability. The user device could, for example, exit the standby state, run an application or perform another task.
Click here to view the patent on Google Patents.