top of page

Data

For training and testing our classifier, we decided to use the audio samples provided by Common Voice [1]. From this, we used three language datasets: English, Hindi, and Mandarin Chinese. English and Chinese files were sampled at 48kHz, while the majority of Hindi files were sampled at 32kHz. Each audio sample is roughly under 10 seconds and generally contains a recording of a sentence or 1-2 short sentences. The recordings of the samples are mostly done in controlled environments, though a small proportion possesses noticeable levels of noise or audio distortion due to poor microphone quality. According to the creators of the dataset, the audio was crowd-sourced and generated by applying forced alignment to the spoken text; this is further explained in their paper, Multilingual Spoken Words Corpus [2]. We’ve included a few sample audio files of the Hindi, Chinese, English audio files below for reference:

Data: Text
Data: Music Player

Before passing our data into any classifiers, we took a few different steps to process the data for the classifier. First, after reading in the data with Matlab’s audioread function, we normalized each audio file by its peak value (amplitude) using the audionormalization_YW function [3]. We also investigated bandpass filtering, but ultimately decided against its inclusion as filtering each audio clip significantly increased computation time while only negligibly improving accuracy. We suspected that this is due to many of the recordings having broadband noise or more complex distortions that are unable to be filtered by a simple bandpass. Additionally, we chose to extract Mel-Frequency Cepstral Coefficients (MFCCs) to use as primary features for audio analysis.


MFCCs?

MFCCs were introduced by Davis and Mermelstein in the 1980s and are a widely used tool for speech recognition [4]. The objective of MFCCs is to imitate the cochlea in the ear, the way we, humans, perceive sound, and capture the distinguishing features of phonemes produced in language spoken from the vocal tract. In humans, the cochlea is more tuned to detect larger differences in frequency. This is reflected in the linear filter bank applied to the audio. Note that at higher and higher frequencies, the triangle filters cover wider and wider bands of frequencies below.

Data: About
MelFilterBankPic.png
Data: Image

(Image above courtesy of MATLAB) [6]

How are MFCCs found?

In general, MFCCs are short-term power spectrum representations of an audio sample. The audio sample is broken into a sequence of overlapping frames, an FFT of each frame is taken, and passed into a filter bank, shown above, to extract the important features of each frequency range. The log function of this is taken; because, for humans, loudness is perceived on a logarithmic scale. Lastly, another FFT, specifically a discrete cosine transform (DCT) is taken in an effort to decorrelate the output for the classifier. This process is shown in the figure below.

Data: Text
MFCCBlockDiagram_edited.jpg
Data: Image

(Image above courtesy of Medium) [7]


Looking at just one frame, we can break down this process a bit further. To understand MFCCs, we can start with understanding what a cepstrum is. First, we take a Fourier transform of a signal in the time domain (a frame). Next, we take the log magnitude of this frequency spectrum. The final step of a DCT (discrete cosine transform) on this results in what is called a cepstrum (which is just the “spec” part of the spectrum reversed). Each frame results in a set of cepstral coefficients. The image below shows the four main steps of the process as discussed.

Data: Text
4StepPic.png
Data: Image

(Image above courtesy of Medium) [7]


Another component of MFCCs is the Mel-Frequency Scale. This scale is computed frequency using the following formula.

Data: Text
MelFEquationPic.png
Data: Image

(Image above courtesy of Medium) [7]


The Mel filters, shown in the first figure, represent this conversion process. A mel scale relates perceived frequency to actual frequency. Humans are better at perceiving differences in frequencies at lower ranges.  At higher ranges, humans perceive this as a larger difference. This behavior is something the Mel scale attempts to capture.



Pitch Features 

In the later stage of our project, we also chose to include pitch features into our previous data matrix of MFCC features. This is based on MATLAB’s pitch command which extracts the fundamental frequency of an audio sample over time. After initially running classification only on the MFCC feature, we thought this addition would additionally capture some of the traits, such as tonality, in a language. 


To extract pitch features, we used MATLAB’s pitch command followed by the removal of non-voice segments following one of MATLAB’s speaker identification examples [5]. Then, we remove all values below 25 Hz or above 375 Hz to remove outliers. To further smooth the data, we remove data outside of 2 standard deviations of the average and apply a moving average filter. Finally, we take the first derivative with respect to time. We are now able to extract the features of pitch peak-to-peak, average of the first derivative, and median of the first derivative.


A summary of the entire process is shown in the left diagram below. Note the last step is the classification step using a machine learning model which will be discussed in the next section. The block diagram on the right outlines the steps of our pitch feature extraction process.

Data: About
FlowchartMerge (3).png
Data: Image
Data: Text
bottom of page