Discussion

We compared the accuracies of our classifiers and found that KNN provided the greatest accuracy for both binary classification and multi-class classification. We also observed that the classifiers performed better for binary classification as compared to the multi-class classification. For example, the accuracy of KNN (our best classifier) dropped from 76.55 % to 43.33%. We also noticed an interesting increase in accuracy when comparing the accuracy of KNN with and without the relieff function. When no relieff is used, the accuracy of the classifier increases by 0.33%. We believe that this might be caused by the presence of features that improved Mandarin classification accuracy but were detrimental to the model’s overall performance. Though our overall accuracy slightly decreases when using relieff, it is beneficial for us to trim the feature matrix as it gives us more balanced accuracy across all languages instead of distinctly higher accuracy in one and poorer accuracy in the others.

The major strength of our classifier is seen from its reasonably accurate performance (about 70-80%) for binary classification. On the other hand, the major weakness can be observed from the accuracy drop as we expand the number of languages to three languages. When we decided to expand our classifier to three languages, we initially only had MFCC features fitted to the model. However, seeing that the classifier performed poorly, we speculated that MFCCs did not capture enough information. Hence, we introduced pitch as an additional feature set to the classifier. Using pitch helped improve the accuracy of our multi-class classification model, but we were still unable to reach our desired 70-80% accuracy. This was the major obstacle we encountered.

As a result, we spent the majority of the time trying to improve the accuracy for the multi-class classifier. We did not have the time to complete our bonus objectives, which are to distinguish different dialects and test to see if the system can distinguish different languages from our own recorded voices. However, we do have a potential solution proposed going forward if this project were to be continued.

We would suggest taking a look at Gaussian Mixture Model (GMM) and fitting that model to our data or adding an extra feature to the features matrix. Several of the research papers that we read while brainstorming the various classifiers to use for this project were able to successfully perform language detection using a GMM with MFCC features. However, due to time constraints, we decided to focus on the KNN since most of the steps towards feature extraction were highly accommodated for a KNN classifier. Secondly, we recommend searching for additional features because they might improve the accuracy of the classification, like how we were able to get a slight improvement with pitch.

In summary, what worked well for our project was that the KNN classifier had the greatest accuracy for our project, and it was reasonably accurate for binary classification. Additionally, the pitch features we implemented for the multi-class classification also worked well to improve our classifier’s accuracy. Unfortunately, the multi-class accuracy that we achieved with our features was still well below our goal. Though it did not work as well as we expected it to, we believe that several potential solutions exist. Using GMM as a model for classification or perhaps looking into additional features for the feature matrix might be beneficial towards improving the accuracy of the system.

A link to our project repo is found here: Github Repo

Relevant Tools

In this project, we used the following tools from 351: frequency-domain analysis, change of basis, filtering, and linear systems in the forms of moving averages and differentiators. Outside of the topics studied in class, we additionally used Mel-Frequency Cepstral Coefficients.

Frequency-domain analysis: Our initial attempts at classification involved Fourier-domain processing at a basic level. We would take the DFT of each audio signal and use the coefficients to train the model. However, we quickly realized that MFCCs were much better to use as features. In our final model, we still process data in the frequency domain in the form of low pass filtering to smooth pitch data.

Change of basis: Processing of data in the Fourier domain requires a change of basis to transform from the time domain to the Fourier domain. We can then perform operations such as filtering to remove unwanted frequencies.

Filtering: In the early stages of our project, we investigated the feasibility of bandpass filtering all our audio clips as a solution to remove noise. We also implemented low pass filtering in our pitch analysis process to lessen the effects of noise. Our final result, however, no longer applies a bandpass filter to input clips due to significantly increased computation time.

Linear systems: In our computation of pitch data, we had the challenge of smoothing out data to minimize the impact noise and other bad data had on our pitch contour. To tackle this problem, we included moving average and differentiation steps, which are linear systems.

Contributions

Return to Top

Discussion: About