EECS 351 Language Classification
Data
For training and testing our classifier, we decided to use the audio samples provided by Common Voice [1]. From this, we used three language datasets: English, Hindi, and Mandarin Chinese. English and Chinese files were sampled at 48kHz, while the majority of Hindi files were sampled at 32kHz. Each audio sample is roughly under 10 seconds and generally contains a recording of a sentence or 1-2 short sentences. The recordings of the samples are mostly done in controlled environments, though a small proportion possesses noticeable levels of noise or audio distortion due to poor microphone quality. According to the creators of the dataset, the audio was crowd-sourced and generated by applying forced alignment to the spoken text; this is further explained in their paper, Multilingual Spoken Words Corpus [2]. We’ve included a few sample audio files of the Hindi, Chinese, English audio files below for reference: