We recorded Marathidatabase of numerals zero to nine

We recorded Marathidatabase of numerals zero to nine. In this we have intended to implement a password system with numerals and many other such applications in everyday life. The 20 samples for each word were recorded from different people and these samples were then normalized by dividing their maximum values. Then they were decomposed using Dynamic Time Warping. Out of 20 samples recorded, 16 samples are used to train the DTW and the unused 4 samples are used for test purpose.
In this project, speech recognition software had been developed using MFCC & DTW algorithms. The reference file was created for different-different pre recorded speech signals. When the microphone input signal was applied, its MFCC coefficients were compared to the pre-recorded speech’s MFCC coefficients using DTW algorithm. The Output scores of DTW calculate the nearest sound of the recorded speech signals. End of the software output was displayed on MATLAB output screen. Software would display correct numeral if applied microphone signal would be compared with pre-recorded ; online signals.
The Resultsof some of the extracted features of recorded database of numerals zero to nine in Marathi are shown in the figures below.

Fig.6.1: Mel Frequency Cepstrum Coefficients Fig.6.2: Mel Frequency Cepstrum
of Shunya. Coefficients of pach.

6.1 Graphical User Interface (GUI) of the system

We have created the GUI of the system for the recognition of the numerals. The DTW 0-9 Digit Recognizer has the various command buttons like record, open, play, recognize etc. It shows the opened wave file.
In this project, we have designed a DTW digit recognizer, in which the command button open reads the pre-recorded numerals and the command button record the online numeral spoken by the speaker. We can play the pre-recorded ; online numeral spoken by the speaker, and then we can recognize the numeral using the DTW for feature matching. It matches the template by taking into account the minimum warping distance between the various numerals. The Template with closest match defined in manner chosen as recognized numeral ; it is displayed on GUI display.

Fig.6.3: GUI of DTW Digit Recognizer. Fig.6.4: GUI of opened wave file.

Fig.6.5: GUI of pattern matching of shunya. Fig.6.6: GUI of recognized numeral shunya.

Fig.6.7: GUI of pattern matching of ek. Fig.6.8: GUI of recognized numeral ek.

Fig.6.9: GUI of pattern matching of saha. Fig.6.10: GUI of recognized numeral saha.

Fig.6.11: GUI of pattern matching of nau. Fig.6.12: GUI of recognized numeral nau.

6.2 Testing And Results

6.2.1 Testing with pre-recorded samples

Out of the 20 samples recorded for each word, 16 were used for training purpose. We tested our program’s accuracy with these 4 unused samples. A total of 20 samples were tested (4 samples each for the 5 words) and the program yielded the right result for all 20samples. Thus, we obtained 100% accuracy with pre- recorded samples.

6.2.2 Real-time testing

For real-time testing, we took a sample using microphone and directly executed the program using this sample. A total of 30 samples were tested, out of which 24 samples gave the right result. This gives an accuracy of about 80% with real-time samples.

6.2.3 Results

? Case1: Speaker independent (20 templates per digit 10 male, 10 female)
The above implemented work is tested for 100 samples of each word spoken by 50 Different speakers with 2 samples of each digit per head.
The testing work leads to the results given in Table 6.1.

Table 6.1: Accuracy of the Speaker Independent Test Results.

DIGIT 0 1 2 3 4 5 6 7 8 9
% ACCURACY 87 88 82 78 79 84 85 81 78 87

? Case2: Speaker Dependent (one template per digit).
The above implemented work is tested for 10 samples of each word spoken by single speaker. The results are:

Table 6.2: Accuracy of the Speaker Dependent Test Results.

DIGIT 0 1 2 3 4 5 6 7 8 9
% ACCURACY 90 91 84 90 87 88 92 84 86 92

It is observed that the accuracy of the pre-recorded samples is more than that of the real-time testing samples. We have also observed that the accuracy of the speaker dependent samples is more than that of the speaker independent samples.

Table 6.3: Confusion Matrix of the MFCC & DTW Recognition.

ek don teen char pach saha sat aath nau shunya Avg. %
ek 1 1 1 4 1 1 1 1 1 0 80
don 2 2 2 2 3 2 2 2 2 2 90
teen 3 3 3 3 9 3 3 2 2 2 80
char 4 4 5 4 4 4 4 6 4 4 80
pach 5 5 5 5 5 5 5 5 5 3 90
Saha 6 6 6 6 1 6 6 4 6 6 80
Sat 7 7 8 7 7 7 7 7 7 7 90
Aath 2 8 8 8 8 7 8 8 8 8 80
nau 9 9 4 9 9 5 9 9 9 9 80
shunya 0 0 0 0 5 0 0 2 0 0 80

Table 6.4: Confusion Matrix of the MFCC & HMM Recognition.

ek don teen char pach saha sat aath nau shunya Avg. %
ek 1 1 1 1 1 1 1 3 1 1 90
don 2 2 2 2 2 2 2 2 2 5 90
teen 3 3 3 3 3 3 1 3 3 3 90
char 4 3 4 4 4 4 4 4 8 4 80
pach 5 5 5 5 5 5 5 5 5 5 100
Saha 6 6 6 8 6 6 6 6 6 6 90
Sat 7 7 7 7 7 7 7 5 7 7 90
Aath 8 8 8 8 8 8 8 8 5 8 90
nau 9 9 9 9 9 7 9 9 9 9 90
shunya 0 0 0 7 0 0 0 5 0 0 80

Table 6.5: Comparison Digit Recognition Accuracy Test Results.

Numeral DTW Accuracy HMM Accuracy
ek 80 90
don 90 90
teen 80 90
char 80 80
pach 90 100
Saha 80 90
Sat 90 90
Aath 80 90
nau 80 90
shunya 80 80
Average % 83% 89%

Experimentally, it is observed that recognition accuracy is better for HMM compared with DTW, but the training procedure in DTW is very simple and fast, as compared with the HMM.

Fig.6.13: Recognition accuracy of the DTW & HMM.

The time required for recognition of numerals using HMM is more as compared to DTW, as it has to go through the many states, iteratations& many more mathematical modeling, so DTW is preferred for the real-time applications as compared with the HMM.

7. CONCLUSIONS and FUTURE SCOPES

7.1 Conclusions

• Though the advances accomplished throughout the last decades, automatic speech recognition (ASR) is still a challenging and difficult task.
• The non-parametric method for modeling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) isused as extraction techniques. The nonlinear sequence alignment known as Dynamic Time Warping (DTW) has been used as features matching techniques. The nonlinear sequence alignment known as Dynamic Time Warping (DTW) has been used as features matching techniques. Since it’s obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance.
• This paper proposed that higher recognition rates can be achieved using MFCC features with DTW which is useful for different time varying numeral speech utterances.
• MFCC analysis provides better recognition rate than LPC as it operates on a logarithmic scale which resembles human auditory system whereas LPC has uniform resolution over the frequency plane. This is followed by pattern recognition. Since the voice signal tends to have different temporal rate, DTW is one of the methods that provide non-linear alignment between two voice signals.
• Another method called HMM that statistically models the words is also presented. Experimentally it is observed that recognition accuracy is better for HMM compared with DTW, but the training procedure in DTW is very simple and fast, as compared with the HMM.
• The time required for recognition of numerals using HMM is more as compared to DTW, as it has to go through the many states, iteratations; many more mathematical modeling, so DTW is preferred for the real-time applications as compared with the HMM .
• DTW is a cost minimization matching technique, in which a test signal is stretched or compressed according to a reference template.
• The accuracy of the pre-recorded samples is more than that of the real-time testing samples. We have also observed that the accuracy of the speaker dependent samples is more than that of the speaker independent samples.

7.2 Future Scope

• One of the key areas where future work can be concentrated is the large vocabulary generation ; to improve robustness of speech recognition performance.
• Another key area of research is focused on an opportunity rather than a problem. This research attempts to take advantage of the fact that in many applications there is a large quantity of speech data available, up to millions of hours. It is too expensive to have humans transcribe such large quantities of speech, so the research focus is on developing new methods ofmachine learning that can effectively utilize large quantities of unlabeled data.
• The better understanding of human capabilities and to use this understanding to improve machine recognition performance.
• The future work could be towards Online Speech Summarization. The
majority of speech summarization research has focused on extracting the most informative dialogue acts from recorded, archived data.
• The future work could be towards minimizing the time required for recognition of numerals using HMM.