Abstract
Introduction: This paper introduces a novel and reliable approach for people with speech impairment to assist them in communicating effectively in real-time. A deep learning technique named as convolution neural network is used as its classifier. With the help of this algorithm, words are recognized from an input which is visual speech, disregarding the audible or acoustic property.
Methods: This network extracts the features from mouth movements and different images, respectively. With the help of a source, non-audible mouth movements are taken as an input and then segregated as subsets to get the desired output. The Complete Datum is then arranged to recognize the word as an affricate.
Results: Convolution neural network is one of the most effective algorithms that extract features, perform classification and provides the desired output from the input images for the speech recognition system.
Conclusion: Recognizing the syllables at real-time from visual mouth movement input is the main objective of the proposed method. When the proposed system was tested, datum accuracy and quantity of training sets proved to be satisfactory. A small set of datum is taken as the first step of learning. In future, a large set of datum can be considered for analyzing the data.
Discussion: On the basis of the type of datum, the network proposed in this paper is tested for its precision level. A network is maintained to identify the syllables, but it fails when syllables are of the same set. There is a requirement of a higher end graphics processing units to reduce the time consumption and increase the efficiency of a network.
Keywords: Convolution Neural Network (CNN), Random Test Image (RTI), Region Of Interest (ROI), Separate Test Image (STI), Syllables, Visual Speech Recognition (VSR).
Graphical Abstract
[http://dx.doi.org/10.1109/ICSCN.2017.8085740]
[http://dx.doi.org/10.1109/SMAP.2017.8022672]
[http://dx.doi.org/10.1109/TMM.2009.2030637]
[http://dx.doi.org/10.1109/ICASSP.2017.7952625]
[http://dx.doi.org/10.1109/ICASSP.2015.7178347]
[http://dx.doi.org/10.1109/ICASSP.2013.6639140]
[http://dx.doi.org/10.1109/QoMEX.2017.7965650]
[http://dx.doi.org/10.1109/34.982900]
[http://dx.doi.org/10.1007/s10489-014-0629-7]
[http://dx.doi.org/10.18653/v1/P16-1094]
[http://dx.doi.org/10.21437/Interspeech.2017-85]
[http://dx.doi.org/10.1109/FG.2017.34]
[http://dx.doi.org/10.1109/VR.2017.7892238]
[http://dx.doi.org/10.1109/TCYB.2015.2488592] [PMID: 26513822]
[http://dx.doi.org/10.1109/ICASSP.2017.7953112]
[http://dx.doi.org/10.1109/TASLP.2017.2716178]
[http://dx.doi.org/10.1109/ACCESS.2018.2796118]