Generic placeholder image

International Journal of Sensors, Wireless Communications and Control

Editor-in-Chief

ISSN (Print): 2210-3279
ISSN (Online): 2210-3287

Research Article

An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition

Author(s): Ankit Kumar* and Rajesh Kumar Aggarwal

Volume 12, Issue 1, 2022

Published on: 18 January, 2021

Page: [19 - 31] Pages: 13

DOI: 10.2174/2210327911666210118143758

Price: $65

Abstract

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a realtime ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language.

Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data.

Methods: To achieve this goal, we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model.

Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER.

Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

Keywords: ASR, DNN, RNN, Hindi speech recognition, multilingual ASR, BLSTM.

Graphical Abstract

[1]
Dahl GE, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 2011; 20(1): 30-42.
[http://dx.doi.org/10.1109/TASL.2011.2134090]
[2]
Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 2012; •••: 29.
[3]
Khanna T, Nand P, Bali V. Permissioned Blockchain Model for End-to-End Trackability in Supply Chain Management 2020.
[4]
Aggarwal D, Bali V, Mittal S. An Insight into Machine Learning Techniques for Predictive Analysis and Feature Selection
[5]
Gangwar S, Bali V, Kumar A. Comparative Analysis of Wind Speed Forecasting Using LSTM and SVM. EAI Endorsed Transactions on Scalable Information Systems 2020; 7(25)159407
[6]
Yadav N, Banerjee K, Bali V. A survey on fatigue detection of workers using machine learning. Int J E-Health Med Commun 2020; 11(3): 1-8.
[http://dx.doi.org/10.4018/IJEHMC.2020070101]
[7]
Bali V, Kumar A, Gangwar S. A novel approach for wind speed forecasting using LSTM-ARIMA deep learning models. Int J Agric Environ Inf Syst 2020; 11(3): 13-30.
[http://dx.doi.org/10.4018/IJAEIS.2020070102]
[8]
Kumar A, Terang PP, Bali V. User-based load visualization of categorical forecasted smart meter data using LSTM network. Int J Multimedia Data Eng Manage 2020; 11(1): 30-50.
[http://dx.doi.org/10.4018/IJMDEM.2020010103]
[9]
Hernandez F, Nguyen N, Ghannay S, Tomashenko N, Estève Y. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. International Conference on Speech and Computer. 198-08.
[10]
Barker J, Watanabe S, Vincent E, Trmal J. 2018.
[11]
Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206-10.
[12]
Hain T, Woodland PC, Evermann G, Povey D. The cu-htk Proc Speech Transcription Workshop.
[13]
Chu SMD, Povey D. IEEE International Conference on Acoustics, Speech and Signal Processing.
[14]
Alibakhshikenari M, Virdee BS, Khalily M, et al. High-Gain On-Chip Antenna Design on Silicon Layer With Aperture Excitation for Terahertz Applications. IEEE Antennas Wirel Propag Lett 2020; 19(9): 1576-80.
[http://dx.doi.org/10.1109/LAWP.2020.3010865]
[15]
Alibakhshikenari M, Virdee BS, See CH, et al. Study on improvement of the performance parameters of a novel 0.41–0.47 THz on-chip antenna based on metasurface concept realized on 50 μm GaAs-layer. Sci Rep 2020; 10(1): 1-9.
[http://dx.doi.org/10.1038/s41598-020-68105-z] [PMID: 31913322]
[16]
Alibakhshikenari M, Virdee BS, See CH, Abd-Alhameed RA, Falcone F, Limiti E. High-Gain Metasurface in polyimide on-chip Antenna Based on cRLH-tL for Sub-terahertz integrated circuits. Sci Rep 2020; 10(1): 1-9.
[http://dx.doi.org/10.1038/s41598-020-61099-8] [PMID: 31913322]
[17]
Alibakhshikenari M, Virdee BS, Limiti E. Study on isolation and radiation behaviours of a 34× 34 array-antennas based on SIW and metasurface properties for applications in terahertz band over 125–300 GHz. Optik (Stuttg) 2020; 206163222
[http://dx.doi.org/10.1016/j.ijleo.2019.163222]
[18]
Alibakhshikenari M, Virdee BS, Shukla P, et al. Isolation enhancement of densely packed array antennas with periodic MTM-photonic bandgap for SAR and MIMO systems. IET Microw Antennas Propag 2019; 14(3): 183-8.
[http://dx.doi.org/10.1049/iet-map.2019.0362]
[19]
Alibakhshikenari M, Virdee BS, See CH, Abd-Alhameed RA, Falcone F, Limiti E. Surface wave reduction in antenna arrays using metasurface inclusion for MIMO and SAR systems. Radio Sci 2019; 54(11): 1067-75.
[http://dx.doi.org/10.1029/2019RS006871]
[20]
Alibakhshikenari M, Virdee BS, Shukla P, et al. Antenna mutual coupling suppression over wideband using embedded periphery slot for antenna arrays. Electronics (Basel) 2018; 7(9): 198.
[http://dx.doi.org/10.3390/electronics7090198]
[21]
Alibakhshikenari M, Virdee BS, See CH, et al. Study on isolation improvement between closely-packed patch antenna arrays based on fractal metamaterial electromagnetic bandgap structures. IET Microw Antennas Propag 2018; 12(14): 2241-7.
[http://dx.doi.org/10.1049/iet-map.2018.5103]
[22]
Dash D, Kim M, Teplansky K, Wang J. Automatic speech recognition with articulatory information and a unified dictionary for Hindi, Marathi, Bengali, and Oriya. Proceedings of the Annual Conference of the International Speech Communication Association.
[23]
Chellapriyadharshini M, Toffy A, Ramasubramanian V. 2018.
[24]
Grézl F, Karafiat M, Janda M. Study of probabilistic and bottle-neck features in multilingual environment. IEEE Workshop on Automatic Speech Recognition & Understanding.
[25]
Sahraeian R, Van Compernolle D. Cross-Entropy Training of DNN Ensemble Acoustic Models for Low-Resource ASR. IEEE/ACM Trans Audio Speech Lang Process 2018; 26(11): 1991-2001. [TASLP]
[26]
Lazaridis A, Himawan I, Motlicek P, Mporas I, Garner PN. Investigating cross-lingual multi-level adaptive networks: The importance of the correlation of source and target languages, No. CONF 2016.
[27]
Vu NT, Schultz T. Multilingual multilayer perceptron for rapid language adaptation between and across language families. Interspeech 2013.
[28]
Tüske Z, Pinto J, Willett D, Schlüter R. Investigation on cross-and multilingual MLP features under matched and mismatched acoustical conditions IEEE International Conference on Acoustics, Speech and Signal Processing.
[29]
Huang JT, Li J, Yu D, Deng L, Gong Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers IEEE International Conference on Acoustics, Speech and Signal Processing.
[30]
Gales MJ, Knill KM, Ragni A, Rath SP. Speech recognition and keyword spotting for low-resource languages: BABEL project research at CUED. Spoken Language Technologies for Under-Resourced Languages 2014.
[31]
Li X, Dalmia S, Black AW, Metze F. 1908.
[32]
Kawakami K, Wang L, Dyer C, Blunsom P, Oord AV. 2001.
[33]
Biswas A, Menon R, van der Westhuizen E, Niesler T. 1907.
[34]
Fantaye TG, Yu J, Hailu TT. Investigation of Automatic Speech Recognition Systems via the Multilingual Deep Neural Network Modeling Methods for a Very Low-Resource Language, Chaha. Journal of Signal and Information Processing 2020; 11(1): 1-21.
[http://dx.doi.org/10.4236/jsip.2020.111001]
[35]
Veselý K, Karafiát M, Grézl F, Janda M, Egorova E. The language-independent bottleneck features. IEEE Spoken Language Technology Workshop (SLT).
[36]
Dua M, Aggarwal RK, Biswas M. Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl 2018; •••: 1-9.
[37]
Aggarwal RK, Dave M. Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecomm Syst 2013; 52(3): 1457-66.
[http://dx.doi.org/10.1007/s11235-011-9623-0]
[38]
Passricha V, Aggarwal RK. Convolutional support vector machines for speech recognition. Int J Speech Technol 2018; •••: 1-9.
[39]
Virender K, Mantri A, Aggarwal RK. Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. J Inst Electron Telecommun Eng 2018; 64(5): 673-88.
[http://dx.doi.org/10.1080/03772063.2017.1369370]
[40]
Vydana HK, Gurugubelli K, Vegesna VV, et al. An Exploration towards joint acoustic modeling for indian languages: IIIT-H submission for low resource speech recognition challenge for indian languages. INTERSPEECH 2018; pp. 3192-6.
[41]
Shetty VM, Sharon RA, et al. Articulatory and stacked bottleneck features for low resource speech recognition. 2018.
[42]
Sailor HB, Krishna MV, et al. 2018.
[43]
Fathima N, Patel T, Mahima C, et al. 2018.
[44]
Ghoshal A, Swietojanski P, Renals S. Multilingual training of deep neural networks IEEE International Conference on Acoustics, Speech and Signal Processing.
[45]
Ni C, Leung CC, Wang L, et al. Efficient methods to train multilingual bottleneck feature extractors for low resource keyword search. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5650-4.
[46]
Zhou S, Zhao Y, Xu S, et al. Multilingual recurrent neural networks with residual learning for low-resource speech recognition. 2017.
[47]
Sahraeian R, Van DC. Using weighted model averaging in distributed multilingual dnns to improve low resource ASR. Procedia Comput Sci 2016; 81: 152-8.
[http://dx.doi.org/10.1016/j.procs.2016.04.043]
[48]
Tong S, Garner P N, Bourlard H. An investigation of deep neural networks for multilingual speech recognition training and adaptation 2017.
[49]
van Yılmaz EH, Leeuwen DV. Investigating bilingual deep neural networks for automatic recognition of code-switching frisian speech. Procedia Comput Sci 2016; 81: 159-66.
[http://dx.doi.org/10.1016/j.procs.2016.04.044]
[50]
Biswas A, de Wet F, van der Westhuizen E, et al. Multilingual neural network acoustic modelling for ASR of under-resourced English-isiZulu code-switched speech. 2018.
[51]
Cui J, Kingsbury B, Ramabhadran B, et al. Multilingual representations for low resource speech recognition and keyword search. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 259-66.
[52]
Lin H, Deng L, Yu D, et al. A study on multilingual acoustic modeling for large vocabulary ASR. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. 4333-36.
[53]
Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. International Conference on Machine Learning. 173-82.
[54]
Seide F, Li G, Yu D. Conversational speech transcription using context-dependent deep neural networks.
[55]
Abdel-Hamid O, Mohamed AR, Jiang H, et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4277-80.
[56]
Yu D, Jinyu L. Recent progresses in deep learning based acoustic models 2017.
[57]
Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. Sixteenth Annual Conference of the International Speech Communication Association.
[58]
Lang KJ, Waibel AH, Hinton GE. A time-delay neural network architecture for isolated word recognition. Neural Netw 1990; 3(1): 23-43.
[http://dx.doi.org/10.1016/0893-6080(90)90044-L]
[59]
Peddinti V, Chen G, Povey D, et al. Reverberation robust acoustic modeling using i-vectors with time delay neural networks. Sixteenth Annual Conference of the International Speech Communication Association.
[60]
Kreyssig FL, Zhang C, Woodland PC. Improved TDNNs using deep kernels and frequency dependent Grid-RNNs. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[61]
Peddinti V, Chen G, Manohar V, et al. Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 539-46.
[62]
Sak H, Senior A, Beaufays F, et al. 2015.
[63]
Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst 2017; 28(10): 2222-32.
[http://dx.doi.org/10.1109/TNNLS.2016.2582924] [PMID: 27411231]
[64]
Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. 2014.
[65]
Feng S, Lee T. Improving cross-lingual knowledge transferability using multilingual TDNN-BLSTM with language-dependent pre-final layer. 2018.
[66]
Smit P, Gangireddy SR, Enarvi S, et al. Aalto. 2017; pp. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 338-45.
[67]
Ali A, Vogel S, Renals S. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[68]
Peddinti V, Wang Y, Povey D, et al. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process Lett 2018; 25(3): 373-7.
[http://dx.doi.org/10.1109/LSP.2017.2723507]
[69]
Eastwood M, Jayne C. Restricted Boltzmann machines for pre-training deep Gaussian networks. International Joint Conference on Neural Networks (IJCNN).
[70]
Veselý K, et al. Sequence-discriminative training of deep neural networks. Interspeech 2013.
[71]
Ko T, Peddinti V, Povey D, Khudanpur S. Audio augmentation for speech recognition. Sixteenth annual conference of the international speech communication association.
[72]
Stolcke A.
[73]
Xu H, Li K, Wang Y, et al. Neural network language modeling with letter-based features and importance sampling. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6109-13.
[74]
Samudravijaya K, Rao PV, Agrawal SS. Hindi speech database Sixth international conference on spoken language processing.

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy