Generic placeholder image

Recent Advances in Computer Science and Communications

Editor-in-Chief

ISSN (Print): 2666-2558
ISSN (Online): 2666-2566

Research Article

Convolution Neural Network Based Visual Speech Recognition System for Syllable Identification

Author(s): Hunny Pahuja*, Priya Ranjan, Amit Ujlayan and Ayush Goyal

Volume 15, Issue 1, 2022

Published on: 17 September, 2020

Page: [139 - 150] Pages: 12

DOI: 10.2174/2666255813999200917142628

Price: $65

Abstract

Introduction: This paper introduces a novel and reliable approach for people with speech impairment to assist them in communicating effectively in real-time. A deep learning technique named as convolution neural network is used as its classifier. With the help of this algorithm, words are recognized from an input which is visual speech, disregarding the audible or acoustic property.

Methods: This network extracts the features from mouth movements and different images, respectively. With the help of a source, non-audible mouth movements are taken as an input and then segregated as subsets to get the desired output. The Complete Datum is then arranged to recognize the word as an affricate.

Results: Convolution neural network is one of the most effective algorithms that extract features, perform classification and provides the desired output from the input images for the speech recognition system.

Conclusion: Recognizing the syllables at real-time from visual mouth movement input is the main objective of the proposed method. When the proposed system was tested, datum accuracy and quantity of training sets proved to be satisfactory. A small set of datum is taken as the first step of learning. In future, a large set of datum can be considered for analyzing the data.

Discussion: On the basis of the type of datum, the network proposed in this paper is tested for its precision level. A network is maintained to identify the syllables, but it fails when syllables are of the same set. There is a requirement of a higher end graphics processing units to reduce the time consumption and increase the efficiency of a network.

Keywords: Convolution Neural Network (CNN), Random Test Image (RTI), Region Of Interest (ROI), Separate Test Image (STI), Syllables, Visual Speech Recognition (VSR).

Graphical Abstract

[1]
T. Li, and F. Shen, "Automatic segmentation of Chinese mandarin speech into syllable-like", 2015 International Conference on Asian Language Processing (IALP), 2015pp. 57-60
[2]
A. Pradhan, A. Shanmugam, A. Prakash, K. Veezhinathan, and H. Murthy, "A syllable based statistical text to speech system", In 21st European Signal Processing Conference, 2014pp. 1-5
[3]
V.A. Devi, "Conversion of speech to braille: Interaction device for visual and hearing impaired", In 2017 Fourth International Conference on Signal Processing, Communication and Networking, 2017pp. 1-6
[http://dx.doi.org/10.1109/ICSCN.2017.8085740]
[4]
L. Lu, X. Zhang, and X. Xu, "Fusion of face and visual speech information for identity verification", In 2017 IEEE International Symposium on Intelligent Signal Processing and Communication Systems, 2017pp. 502-506
[5]
E. Spyrou, T. Giannakopoulos, D. Sgouropoulos, and M. Papakostas, "Extracting emotions from speech using a bag-of-visual-words approach", In 2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization, 2017pp. 80-83
[http://dx.doi.org/10.1109/SMAP.2017.8022672]
[6]
V.J. Alcazar, A.N. Maulana, R.O. Mortega, and M.J. Samonte, "Speech- to-visual approach e-learning systems for the deaf", In 2016 11th International Conference on Computer Science and Education (ICCSE), 2016pp. 239-243
[7]
G. Zhao, M. Barnard, and M. Pietikainen, "Lipreading with local spatiotemporal descriptors", IEEE Trans. Multimed., vol. 11, no. 7, pp. 1254-1265, 2009.
[http://dx.doi.org/10.1109/TMM.2009.2030637]
[8]
S. Petridis, Z. Li, and M. Pantie, "End-to-end visual speech recognition with LSTMS", In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 2017pp. 2592-2596
[http://dx.doi.org/10.1109/ICASSP.2017.7952625]
[9]
Y. Mroueh, E. Marcheret, and V. Goel, "Deep multimodal learning for audio-visual speech recognition", In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, 2015pp. 2130-2134
[10]
J.C. Hou, S.S. Wang, Y.H. Lai, J.C. Lin, Y. Tsao, H.W. Chang, and H.M. Wang, "Audio- visual speech enhancement using deep neural networks", In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017pp. 1-6
[11]
W. Feng, N. Guan, Y. Li, X. Zhang, and Z. Luo, "Audio visual speech recognition with multimodal recurrent neural networks", In 2017 International Joint Conference on Neural Networks, 2017pp. 681-688
[12]
M. Karthikadevi, and K.G. Srinivasagan, "The development of syllable-based test to speech system for Tamil language", In 2014 International Conference on Recent Trends in Information Technology, 2014pp. 1-6
[13]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng, "Multimodal deep learning", In Proceedings of the 28th International Conference on Machine Learning, 2011pp. 689-696
[14]
Y. Mroueh, E. Marcheret, and V. Goel, "Deep multimodal learning for audio-visual speech recognition", In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, 2015pp. 2130-2134
[http://dx.doi.org/10.1109/ICASSP.2015.7178347]
[15]
B. Sabzalian, and V. Abolghasemi, "Iterative weighted non-smooth non-negative matrix factorization for face recognition", Int. J. Eng., vol. 31, no. 10, pp. 1698-1707, 2018.
[16]
J. Huang, and B. Kingsbury, "Audio-visual deep learning for noise robust speech recognition", In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013pp. 7596-7599
[http://dx.doi.org/10.1109/ICASSP.2013.6639140]
[17]
H. Stenzel, P.J. Jackson, and J. Francombe, "Speech reaction time measurements for the evaluation of audio-visual spatial coherence", In 2017 Ninth International Conference on Quality of Multimedia Experience, 2017pp. 1-6
[http://dx.doi.org/10.1109/QoMEX.2017.7965650]
[18]
A.Z. Frisky, C.Y. Wang, A. Santoso, and J.C. Wang, "Lip-based visual speech recognition system", 2015 International Carnahan Conference on Security Technology (ICCST), 2016pp. 315-319
[19]
I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, and R. Harvey, "Extraction of visual features for lipreading", IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 198-213, 2002.
[http://dx.doi.org/10.1109/34.982900]
[20]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng, "Multimodal deep learning", In 2011 Proceedings of the 28th International Conference on Machine Learning (ICML), 2011pp. 689-696
[21]
N. Srivastava, and R. Salakhutdinov, "Multimodal learning with deep boltzmann machines", Adv. Neural Inf. Process. Syst., vol. 1, p. 2, 2012.
[22]
M. Zimmermann, M.M. Ghazi, H.K. Ekenel, and J.P. Thiran, "Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system", In Asian Conference on Computer Vision, 2016pp. 264-276
[23]
K. Noda, Y. Yamaguchi, K. Nakadai, H.G. Okuno, and T. Ogata, "Audio-visual speech recognition using deep learning", Appl. Intell., vol. 42, no. 4, pp. 722-737, 2015.
[http://dx.doi.org/10.1007/s10489-014-0629-7]
[24]
J. Li, M. Galley, C. Brockett, G.P. Spithourakis, J. Gao, and B. Dolan, "A persona-based neural conversation model", arXiv preprint arXiv:1603.06155, 2016.
[http://dx.doi.org/10.18653/v1/P16-1094]
[25]
S. Petridis, Z. Li, and M. Pantic, "End-to-end visual speech recognition with LSTMs", In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 2017pp. 2592-2596
[26]
B. Shillingford, Y. Assael, M.W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, and M. Mulville, "Large-scale visual speech recognition", arXiv preprint arXiv:1807.05162, 2018.
[27]
T. Stafylakis, and G. Tzimiropoulos, "Combining residual networks with LSTMs for lipreading", arXiv preprint arXiv:1703.04105, 2017.
[http://dx.doi.org/10.21437/Interspeech.2017-85]
[28]
F. Yaghmaee, "Robust fuzzy content based regularization technique in super resolution imaging", Int. J. Eng., vol. 29, no. 6, pp. 769-777, 2016.
[29]
A. Fernandez-Lopez, O. Martinez, and F.M. Sukno, "Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database", In 12th IEEE International Conference on Automatic Face and Gesture Recognition, 2017pp. 208-215
[http://dx.doi.org/10.1109/FG.2017.34]
[30]
I. Jarraya, S. Werda, and W. Mahdi, "Lip tracking using particle filter and geometric model for visual speech recognition", International Conference on Signal Processing and Multimedia Applications, 2016pp. 172-179
[31]
R. Luo, Q. Fang, J. Wei, W. Lu, W. Xu, and Y. Yang, Acoustic VR in the mouth: A real-time speech-driven visual tongue systemIn 2017 IEEE Virtual Reality, 2017, pp. 112-121.
[http://dx.doi.org/10.1109/VR.2017.7892238]
[32]
P. Bratoszewski, G. Szwoch, and A. Czyzewski, Comparison of acoustic and visual voice activity detection for noisy speech recognition.In 2016 Signal Processing., Algorithms, Architectures, Arrangements, and Applications, 2016, pp. 287-291.
[33]
C. Georgakis, S. Petridis, and M. Pantic, "Discrimination between native and non-native speech using visual features only", IEEE Trans. Cybern., vol. 46, no. 12, pp. 2758-2771, 2016.
[http://dx.doi.org/10.1109/TCYB.2015.2488592] [PMID: 26513822]
[34]
A. Gupta, Y. Miao, L. Neves, and F. Metze, "Visual features for context-aware speech recognition", IEEE international conference on acoustics, speech and signal processing, pp. 5020-5024, 2017.
[http://dx.doi.org/10.1109/ICASSP.2017.7953112]
[35]
T. Le Cornu, and B. Milner, "Generating intelligible audio speech from visual speech", IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 9, pp. 1751-1761, 2017.
[http://dx.doi.org/10.1109/TASLP.2017.2716178]
[36]
Y. Yuan, C. Tian, and X. Lu, "Auxiliary loss multimodal GRU model in audio-visual speech recognition", IEEE Access, vol. 6, pp. 5573-5583, 2018.
[http://dx.doi.org/10.1109/ACCESS.2018.2796118]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy