语音识别研究综述
作者:
基金项目:

国家自然科学基金(61806178);浙江省自然科学基金(LY21F010015)


Survey on Speech Recognition
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [75]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    语音识别使声音变得“可读”,让计算机能够“听懂”人类的语言并做出反应,是人工智能实现人机交互的关键技术之一.本文介绍了语音识别的发展历程,阐述了语音识别的原理概念与基础框架,分析了语音识别领域的研究热点和难点,最后,对语音识别技术进行了总结并就其未来研究进行了展望.

    Abstract:

    Speech recognition, which makes the voice readable and enables the computer to understand and respond to human language, is one of the key technologies for human-computer interaction in artificial intelligence. This study introduces the development of speech recognition, expounds the principles, concepts, and basic framework of speech recognition, and analyzes the research hotspots and difficulties in the related field. Finally, it summarizes the speech recognition technologies and presents an outlook on future research into this field.

    参考文献
    [1] Lee KF, Hon HW, Reddy R. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1990, 38(1): 35–45.
    [2] Young SJ, Young S. The HTK hidden Markov model toolkit: Design and philosophy. 1994. https://www.researchgate.net/publication/263124034
    [3] Mohamed AR, Dahl G, Hinton G. Deep belief networks for phone recognition. Nips Workshop on Deep Learning for Speech Recognition and Related Applications. 2009, 1(9): 39.
    [4] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97. [doi: 10.1109/MSP.2012.2205597
    [5] Wang D, Zhang XW. THCHS-30: A free Chinese speech corpus. arXiv: 1512.01882, 2015.
    [6] Qian YM, Bi MX, Tan T, et al. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(12): 2263–2276. [doi: 10.1109/TASLP.2016.2602884
    [7] Tan T, Qian YM, Hu H, et al. Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(8): 1393–1405. [doi: 10.1109/TASLP.2018.2825432
    [8] Hannun A, Case C, Casper J, et al. DeepSpeech: Scaling up end-to-end speech recognition. arXiv: 1412.5567, 2014.
    [9] Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in English and mandarin. Proceedings of the 33rd International Conference on Machine Learning. New York: ACM, 2016. 173–182.
    [10] Sriram A, Jun H, Satheesh S, et al. Cold fusion: Training Seq2Seq models together with language models. arXiv: 1708.06426v1, 2017.
    [11] Zhang WD, Zhang F, Chen W, et al. Fault state recognition of rolling bearing based fully convolutional network. Computing in Science & Engineering, 2019, 21(5): 55–63
    [12] Zhang SL, Lei M, Yan ZJ, et al. Deep-FSMN for large vocabulary continuous speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5869–5873.
    [13] 张学工. 模式识别. 3版. 北京: 清华大学出版社, 2010.
    [14] Graves A, Mohamed AR, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2013. 6645–6649.
    [15] Abdel-Hamid O, Mohamed AR, Jiang H, et al. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533–1545. [doi: 10.1109/TASLP.2014.2339736
    [16] Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. 15th Annual Conference of the International Speech Communication Association. Singapore, 2014.
    [17] Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of Interspeech 2015, 2015. 3214–3218.
    [18] Sainath TN, Vinyals O, Senior A, et al. Convolutional, long short-term memory, fully connected deep neural networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane: IEEE, 2015. 4580–4584.
    [19] Li J, Lavrukhin V, Ginsburg B, et al. Jasper: An end-to-end convolutional neural acoustic model. arXiv: 1904.03288v3, 2019.
    [20] Pundak G, Sainath TN. Highway-LSTM and recurrent highway networks for speech recognition. Proceedings of Interspeech 2017. 2017. 1303–1307.
    [21] Xiang HY, Ou ZJ. CRF-based single-stage acoustic modeling with CTC topology. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 5676–5680.
    [22] Bahl LR, Jelinek F, Mercer RL. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1983, PAMI-5(2): 179–190. [doi: 10.1109/TPAMI.1983.4767370
    [23] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3: 1137–1155
    [24] Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model. Eleventh Annual Conference of the International Speech Communication Association. Makuhari: DBLP, 2010. 1045–1048.
    [25] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh: Association for Computing Machinery, 2006. 369–376.
    [26] Bahdanau D, Cho KH, Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015. arXiv: 1409.0473v6, 2015.
    [27] Graves A. Sequence transduction with recurrent neural networks. arXiv: 1211.3711, 2012.
    [28] Chan W, Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai: IEEE, 2016. 4960–4964.
    [29] Pang RM, Sainath TN, Prabhavalkar R, et al. Compression of end-to-end models. Proceedings of Interspeech 2018. Hyderabad, 2018. 27–31.
    [30] Chiu CC, Sainath TN, Wu YH, et al. State-of-the-art speech recognition with sequence-to-sequence models. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 4774–4778.
    [31] Sun SN, Guo PC, Xie L, et al. Adversarial regularization for attention based end-to-end robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11): 1826–1838. [doi: 10.1109/TASLP.2019.2933146
    [32] Sak H, Senior A, Rao K, et al. Learning acoustic frame labeling for speech recognition with recurrent neural networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane: IEEE, 2015. 4280–4284.
    [33] Miao YJ, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Scottsdale: IEEE, 2015. 167–174.
    [34] He YZ, Sainath TN, Prabhavalkar R, et al. Streaming end-to-end speech recognition for mobile devices. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 6381–6385.
    [35] Dong LH, Xu S, Xu B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5884–5888.
    [36] Moritz N, Hori T, Le Roux J. Triggered attention for end-to-end speech recognition. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 5666–5670.
    [37] Watanabe S, Hori T, Kim S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240–1253. [doi: 10.1109/JSTSP.2017.2763455
    [38] Zhang Q, Lu H, Sak H, et al. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 7829–7833.
    [39] Han W, Zhang ZD, Zhang Y, et al. ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv: 2005.03191v3, 2020.
    [40] Gulati A, Qin J, Chiu CC, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv: 2005.08100v1, 2020.
    [41] Pertil? P, Parviainen M. Time difference of arrival estimation of speech signals using deep neural networks with integrated time-frequency masking. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 436–440.
    [42] Park J, Chang JH. State-space microphone array nonlinear acoustic echo cancellation using multi-microphone near-end speech covariance. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(10): 1520–1534. [doi: 10.1109/TASLP.2019.2923969
    [43] Moore AH, Xue W, Naylor PA, et al. Noise covariance matrix estimation for rotating microphone arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(3): 519–530. [doi: 10.1109/TASLP.2018.2882307
    [44] Fazel A, El-Khamy M, Lee J. CAD-AEC: Context-aware deep acoustic echo cancellation. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 6919–6923.
    [45] Wang ZQ, Le Roux J, Hershey JR. Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 1–5.
    [46] Ye SS, Hu XH, Xu XK. TDCGAN: Temporal dilated convolutional generative adversarial network for end-to-end speech enhancement. arXiv: 2008.07787, 2020.
    [47] Jiang Y, Liu RS, Bai Y. An auditory-based monaural feature for noisy and reverberant speech enhancement. 2017 International Conference on Computing Intelligence and Information System (CIIS). Nanjing: IEEE, 2017. 100–103.
    [48] Zhang HY, Liu CG, Inoue N, et al. Multi-task autoencoder for noise-robust speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5599–5603.
    [49] Meng Z, Li JY, Gong YF, et al. Adversarial teacher-student learning for unsupervised domain adaptation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5949–5953.
    [50] Ganapathy S, Peddinti V. 3-D CNN models for far-field multi-channel speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5499–5503.
    [51] Liu B, Nie S, Zhang YP, et al. Boosting noise robustness of acoustic model via deep adversarial training. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5034–5038.
    [52] Mo?ner L, Wu MH, Raju A, et al. Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 6475–6479.
    [53] Lee W, Seong JJ, Ozlu B, et al. Biosignal sensors and deep learning-based speech recognition: A review. Sensors, 2021, 21(4): 1399. [doi: 10.3390/s21041399
    [54] Vijayan A, Mathai BM, Valsalan K, et al. Throat microphone speech recognition using MFCC. 2017 International Conference on Networks & Advances in Computational Technologies (NetACT). Thiruvananthapuram: IEEE, 2017. 392–395.
    [55] Pujari S, Sneha SK, Vinusha R, et al. A survey on deep learning based lip-reading techniques. 2021 3rd International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). Tirunelveli: IEEE, 2021. 1286–1293.
    [56] Zhou P, Yang WW, Chen W, et al. Modality attention for end-to-end audio-visual speech recognition. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2019. 6565–6569.
    [57] Makino T, Liao H, Assael Y, et al. Recurrent neural network transducer for audio-visual speech recognition. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Singapore: IEEE, 2019. 905–912.
    [58] Yi JY, Tao JH, Wen ZQ, et al. Adversarial multilingual training for low-resource speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 4899–4903.
    [59] Sahraeian R, Van Compernolle D. Cross-entropy training of DNN ensemble acoustic models for low-resource ASR. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 1991–2001. [doi: 10.1109/TASLP.2018.2851145
    [60] Yi JY, Tao JH, Wen ZQ, et al. Language-adversarial transfer learning for low-resource speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(3): 621–630. [doi: 10.1109/TASLP.2018.2889606
    [61] Yoo S, Song I, Bengio Y. A highly adaptive acoustic model for accurate multi-dialect speech recognition. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 5716–5720.
    [62] Elfeky M, Bastani M, Velez X, et al. Towards acoustic model unification across dialects. 2016 IEEE Spoken Language Technology Workshop (SLT). San Diego: IEEE, 2016. 624–628.
    [63] Kamper H, Niesler TR. Multi-accent speech recognition of Afrikaans, black and white varieties of South African English. 12th Annual Conference of the International Speech Communication Association. Florence: DBLP, 2011. 3189–3192.
    [64] Chen YC, Yang ZJ, Yeh CF, et al. Aipnet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 6979–6983.
    [65] Wu ZF, Zhao D, Liang Q, et al. Dynamic sparsity neural networks for automatic speech recognition. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto: IEEE, 2021. 6014–6018.
    [66] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 2704–2713.
    [67] Shi YY, Hwang MY, Lei X, et al. Knowledge distillation for recurrent neural network language modeling with trust regularization. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 7230–7234.
    [68] Takashima R, Sheng L, Kawai H. Investigation of sequence-level knowledge distillation methods for CTC acoustic models. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 6156–6160.
    [69] Kim HG, Na H, Lee H, et al. Knowledge distillation using output errors for self-attention end-to-end models. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 6181–6185.
    [70] Zhang WY, Chang XK, Qian YM, et al. Improving end-to-end single-channel multi-talker speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1385–1394. [doi: 10.1109/TASLP.2020.2988423
    [71] Pérez AF, Sanguineti V, Morerio P, et al. Audio-visual model distillation using acoustic images. Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Snowmass: IEEE, 2020. 2843–2852.
    [72] Kim S, Arora A, Le D, et al. Semantic distance: A new metric for ASR performance analysis towards spoken language understanding. arXiv: 2104.02138, 2021.
    [73] Shenoy A, Bodapati S, Sunkara M, et al. “What’s the context?”: Long context NLM adaptation for ASR rescoring in conversational agents. arXiv: 2104.11070, 2021.
    [74] Zhou ZY, Song XC, Botros R, et al. A neural network based ranking framework to improve ASR with NLU related knowledge deployed. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 6450–6454.
    [75] Liu XY, Li MD, Chen LX, et al. ASR N-best fusion nets. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto: IEEE, 2021. 7618–7622.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

马晗,唐柔冰,张义,张巧灵.语音识别研究综述.计算机系统应用,2022,31(1):1-10

复制
分享
文章指标
  • 点击次数:3004
  • 下载次数: 8457
  • HTML阅读次数: 16199
  • 引用次数: 0
历史
  • 收稿日期:2021-04-20
  • 最后修改日期:2021-05-19
  • 在线发布日期: 2021-12-17
文章二维码
您是第11482586位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号