Invented by Ozlem Kalinli-Akbacak, Sony Interactive Entertainment Inc
The Sony Interactive Entertainment Inc invention works as followsA speech-recognition system includes a boundary classifier and a phone classifier.” Phone classifiers generate combined boundary posteriors using a combination auditory attention features, phone posteriors and a machine-learning algorithm that classifies phone posterior context. The combined boundary posteriors are used by the boundary classifier to estimate boundaries in speech in the audio signal.
Background for A speech recognition system that uses machine learning to classify the phone’s posterior context and estimate boundary boundaries from combined boundary posteriors”.
Segmenting continuous speech into segments can be beneficial for many applications, including speech analysis (Speech Analysis), automatic speech recognition(ASR) and Speech Synthesis. Manually determining phonetic transcripts and segmentations, for example requires expert knowledge, and is expensive and time-consuming for large databases. Many automatic segmentation methods and labeling techniques have been proposed to solve this problem.
Proposed methods include S. Dusan and L. Rabiner’s, “On the relationship between maximum spectral shift positions and phone boundaries” (S. Dusan & L. Rabiner). In Proc. Reference  (hereinafter referred to as ‘Reference’) is the 2006 Proc. Reference ;  Qiao and N. Shimomura and N. Minematsu: Unsupervised optimal phoneme Segmentation: Objectives, Algorithm and Comparisons In Proc. Reference  is a reference to the 2008 Proc. Reference : F. Brugnara and D. Falavigna and M. Omologo. Automatic segmentation and labeling speech based upon hidden Markov models. Speech Communication, vol. 12, no, 4, pp, 357-370, 1993 (hereinafter ?Reference ? Reference ;  A. Sethy, S. S. Narayanan and J. A. Y. Chen, “Refined speech segmentation using concatenative speech syntheses” (Proc. In Proc. Reference  is a reference to the ICSLP 2002 Proceedings (hereinafter referred to as?Reference? Reference  (hereinafter referred to as “Reference”). In Proc. Reference [5 ]?).
These proposed methods correspond with references [1, 2, 3 and 4] cited in the paper entitled “Automatic Phoneme Segregation Using Auditory attention Features” “These proposed methods correspond to references [1, 2, 3, 4, 5] cited in a phoneme segmentation paper entitled?Automatic Phoneme Segmentation Using Auditory Attention Features?
A first group proposed segmentation methods requires transcriptions which are not always available. If the transcription is unavailable, you may want to consider a phoneme recognition tool for segmentation. HMMs, which are designed to identify the phone sequence correctly, cannot accurately place boundaries for phone calls. Refer to Reference . Another group of methods do not require prior knowledge of transcription, or of acoustic phoneme models. However, their performance is usually limited.
The present disclosure is a part of this context.
Although this detailed description includes many specific details to illustrate the invention, any person of ordinary skill will understand that the details can be altered and varied in many ways. The exemplary embodiments are described without losing generality and without imposing any limitations on the claimed invention.
Boundary-detection methods using auditory attention features have been proposed.” Phoneme posteriors and auditory attention features can be used to improve boundary accuracy. Phoneme posteriors can be obtained by training models (for example, deep neural networks) that estimate phoneme class posterior scores given acoustic characteristics (mfcc filterbank, etc.). Phoneme posteriors are obtained by training a model (for example, a deep neural network) which estimates phoneme class posterior score given acoustic features (mfcc, mel filterbank etc.). This information is very helpful for boundary detection. By combining auditory attention features with phoneme posteriors, it is suggested that boundary detection performance could be improved. This can be done by using the phoneme posteriors from the current frame. The context information from neighboring frames can also help improve performance.
In the present disclosure a new segmentation method is proposed that combines phone posteriors with auditory attention features. The algorithm is accurate and does not require transcription.
Patent application No. Ser. No. No. The entire contents are incorporated by reference herein. Phoneme posteriors can be combined with auditory features to improve boundary accuracy. Phoneme posteriors can be obtained through training a model that estimates the phoneme class posterior scores given acoustic characteristics (mfcc filterbank, mel filters etc.). Around the boundary, the accuracy of these models’ phoneme classification drops because the posteriors are more difficult to distinguish. In the middle of the phoneme segment there is a clear winner. This information is very helpful for boundary detection. It is therefore proposed here that boundary detection performance could be improved by combining auditory attention features with phoneme posteriors. This can be done by using the phoneme posteriors from a frame. The context information of neighboring frames can also help improve performance.
In certain aspects of the disclosure, a signal corresponding with recorded audio can be analyzed in order to determine boundaries such as phoneme borders. This boundary detection can be achieved by extracting the auditory attention features and phoneme posteriors. Combining the auditory attention features with phoneme posteriors can be used to detect boundaries within a signal. The present disclosure can be summarized by stating that first, auditory attention features are extracted. Next, phone posterior extraction will be described. Two approaches are then discussed for combining phoneme posteriors and auditory attention features for boundary detection.
In the disclosure presented here, a novel phoneme segmentation method is proposed that utilizes auditory attention cues. The motivation for the proposed method, which is not limited to any particular theory of operation, is as follows. In a spectrum of speech, edges and discontinuities are usually visible around the phoneme boundaries. This is especially true around vowels, which have a high formant energy. In FIG. In the paper “Automatic Phoneme Segregation Using Auditory attention Features” (p. As mentioned earlier, the spectrum of a segment of speech that is transcribed?his Captain was? The approximate boundaries of phonemes are shown along with the spectrum. One can see some of the phoneme boundaries in the spectrum. For example, the boundaries for the vowels ih ae ix, etc. It is therefore believed that the auditory spectrum can be used to detect the edges and discontinuities of the phonemes. Visually, it is possible to locate phoneme segments or boundaries in speech.
Auditory Attention Features that Attract Auditory Interest
Auditory Attention Cues” are extracted and inspired by the stages of processing in the human auditory systems. The sound spectrum is filtered using 2D spectrotemporal filters that are based on the stages of the central auditory systems and then converted into low-level auditory features. The auditory attention model is different from the previous work in the literature because it analyzes the 2D spectrum like an image to detect edges and local temporal or spectral discontinuities. It detects boundaries in the speech.
The auditory attention model can be seen as an analogy to a visual image. Contrast features are extracted in multiple scales from the spectrum using 2D spectrotemporal receptive filter. The extracted features can be tuned for different local oriented edge: e.g. frequency contrast features could be tuned for local horizontally-oriented edges which are useful in detecting formants and their change. Then, low-level auditory gists can be extracted and a neural net can be used for discovering the relevant oriented edge and learning the mapping between gists and phoneme boundaries.
The following steps can be taken to extract auditory attention cues in a speech input signal. The first spectrum can be calculated using an early auditory model or a fast Fourier Transform (FFT). The central auditory system can be used to extract multi-scale features. Then, the differences between the center and surround can be calculated and finer and coarser scales compared. By dividing the feature map into m by n grids and computing the mean for each sub-region, auditory gist features can be calculated. The auditory gist’s dimension and redundancy can be reduced using, for example, principle component analysis (PCA), or discrete cosine transformation (DCT). Dimension reduction and redundancy are reduced to produce final features, referred herein as auditory gists.
Patent application No. No. 13/078,866. In FIG. 1, a block diagram of the attention model, and a flow diagram for feature extraction are shown. 1A. The flow chart in FIG. According to the aspects of this disclosure, FIG. 1A shows a method that uses auditory attention cues for syllable/vowel/phone boundaries detection in speech. The auditory attention is biologically based and mimics stages of processing in the human auditory systems. It is designed to determine where and when sound signals will attract human attention.
At first, an input window for sound 101 is received. This input window may be captured using a microphone that converts the acoustic waveforms of a specific input window to an electric signal. The input window for sound 101 can be any segment of speech. The input window for sound 101 can contain, by way of example and without limitation, a single word, syllable or sentence.
The input window for sound 101 is passed through a series of processing stages that convert the window into an audio spectrum. These stages 103 may be based on early stages of auditory systems, like the human auditory. The processing stages 103 can, for example, consist of inner hair cells, cochlear filters, and lateral inhibiting stages, which mimic the auditory system’s process of moving from the basilar membrane towards the cochlear nicleus. Cochlear filters can be implemented by a bank 128 of overlapping constant Q asymmetric band-pass filters, with center frequencies uniformly distributed on a logarithmic frequency scale. These filters can be implemented using electronic hardware that is configured to suit the filtering needs. The filters can also be implemented using a general purpose computer with software that performs the filter functions. To analyze audio, 20 ms frames with a 10 ms delay can be used. This results in each frame of audio being represented as a 128-dimensional vector.
The central auditory system can be simulated by analyzing the auditory spectrum by extracting the multi-scale features (117) as shown at 107. Auditory attention is captured or directed voluntarily to a variety of acoustical characteristics such as intensity (or “energy”), frequency, temporal (pitch), timbre (or “orientation” in this case), etc. here), etc. These features can then be implemented to simulate the receptive field in the primary auditory cortex.
Click here to view the patent on Google Patents.