Plotted Data
To analyze speech, we use time-frequency analysis with spectrograms to observe the changes in Fourier coefficients over time. Figure 1 displays the spectrogram of the English phrase, “But it’s just something I forgot to pack”. Features can be extracted from the spectrogram, for instance, the higher frequency peaks caused by the consecutive “S” fricatives. The plot was obtained by using MATLAB’s "spectrogram" command with the absolute value of the FFT coefficients.
Figure 1: English example clip spectrogram
Mel-frequency analysis is commonly used for applications involving speech or natural language processing. Converting to the Mel scale accounts for the acoustic properties of the human ear and allows our model to mimic human perception of audio. Figure 2 shows the mel-spectrogram of the same English clip shown in Figure 1. The “S” fricatives are still visibly present but have “shorter” peaks corresponding to lower frequencies in the Mel scale. The figure was obtained by using MATLAB's "mfcc" command.
Figure 2: English example clip mel-spectrogram
In order to classify languages, we use the features extracted from the training data of each language to train our classifier. Figure 3 is the mel-spectrogram of a clip in Mandarin. One distinct feature that is visible is the concave dip seen just before the one-second mark. The speech involves a falling tone and a rising tone, which appears as a distinctive feature that only shows up in the Mandarin mel-spectrogram.
Figure 3: Mandarin example mel-spectrogram
Current Progress & Results
We utilized MATLAB's "fitcauto" command, using 200 audio clips of English and Mandarin each as training data. We then used the model to categorize 100 clips of English and Mandarin each. When we interpreted our results from the Confusion Matrix, Figure 4, we found out that the model was pretty good at classifying the Chinese data, but did not perform as well when it came to the English data. As we looked through our training data to see what might have caused this, we discovered that much of the English training data had poorer quality than that of the Chinese training data. We suspect this is what is causing the discrepancy in the performance of our model between the two languages and have come up with three possible workarounds for this issue.
The first possible solution is to comb through the English training dataset and manually pluck out the samples of good quality to feed into our model. Another solution would be to simply find another dataset with samples of better quality. After further analysis of the spectrograms of our data, we came up with a third possible solution, which is to pre-filter our training data to reduce the amount of white noise in it and therefore improve its quality before using it to train our model. While we do not know for sure if the poor performance of the English classification is due to the bad quality of the training data, we have decided to tackle this issue first in order to see if this solves the problem. If the model gives us similar results even after we have fixed the training data quality, then we will take a closer look at the audio features that we choose for implementation in our model.
Figure 4: Confusion matrix of our model run on 200 audio clips
Future Plan
Over the next few weeks, we plan to accomplish the following tasks:
Fix the quality issue of the English dataset used for training our model.
If the above task does not improve the classifier, we will then investigate our approach to extracting audio features used for our model.
Expand our model’s abilities to classify more than 2 languages
New Knowledge Learned
A new DSP concept that we have used is Mel-frequency Cepstral Coefficients (MFCCs), which are coefficients that make up a Mel-frequency cepstrum (MFC). The MFC is a short-term power spectrum based on a Discrete Cosine Transform (DCT) of a log power spectrum. It uses the mel scale, which is non-linear and mimics the behavior of human ears.
We decided to use MFCCs in our project because they allow our model to imitate the conduct of human ears. For voiced sounds, the source approximately has an -12dB/octave slope. However, the acoustic energy, which radiates from the lips, causes a rough +6 dB/octave boost to the spectrum. As a result, a speech signal recorded from a microphone roughly has a -6 dB/octave slope compared to the true spectrum of the vocal tract. MFCC takes this into account by emphasizing higher frequencies to balance out the spectrum of voice sounds with steep roll-off in the high-frequency region.
MFCC can be considered one of the most important feature extraction techniques for speaker identification. These features can be referred to as a set of acoustic properties of speech that can be correlated to the audio signal and can be calculated or estimated by processing the signal waveform.
Comments