End-to-end Speech Recognition Based on Conformer-SE
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [33]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    The end-to-end Transformer model based on the self-attention mechanism shows superior performance in speech recognition. However, this model has limitations in capturing local feature information during shallow processing and does not fully consider the interdependence between different blocks. To address these issues, this study proposes Conformer-SE, an improved end-to-end model for speech recognition. The model first adopts the Conformer structure to replace the encoder in the Transformer model, thus enhancing its ability to extract local features. Next, by introducing the SE channel attention mechanism, it integrates the output of each block into the final output through a weighted sum. The experimental results on the Aishell-1 dataset show that the Conformer-SE model reduces the character error rate by 18.18% compared to the original Transformer model.

    Reference
    [1] 王家, 龙冬梅. 深度学习在语音识别中的应用综述. 电脑知识与技术, 2020, 16(34): 191–192, 197.
    [2] Davis KH, Biddulph R, Balashek S. Automatic recognition of spoken digits. The Journal of the Acoustical Society of America, 1952, 24(6): 637–642.
    [3] Vintsyuk TK. Speech discrimination by dynamic programming. Cybernetics, 1968, 4(1): 52–57.
    [4] 吴延占. 基于HMM与遗传神经网络的改进语音识别系统. 计算机系统应用, 2016, 25(1): 204–208.
    [5] Karpagavalli S, Chandra E. Phoneme and word based model for Tamil speech recognition using GMM-HMM. Proceedings of the 2015 International Conference on Advanced Computing and Communication Systems. Coimbatore: IEEE, 2015. 1–5.
    [6] 张丹. 深度学习神经网络在语音识别中的应用探讨. 电子世界, 2021(6): 67–68.
    [7] Dahl GE, Yu D, Deng L, et al. Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague: IEEE, 2011. 4688–4691.
    [8] Dahl GE, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 30–42.
    [9] 李荪, 曹峰. 智能语音技术端到端框架模型分析和趋势研究. 计算机科学, 2022, 49(6A): 331–336.
    [10] Zhang QL, Chen JF, Bai JS. Language model based non-speech recognition method. Proceedings of the 2019 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). Dalian: IEEE, 2019. 1–5.
    [11] 王澳回, 张珑, 宋文宇, 等. 端到端流式语音识别研究综述. 计算机工程与应用, 2023, 59(2): 22–33.
    [12] Miao HR, Cheng GF, Zhang PY, et al. Online hybrid CTC/attention architecture for end-to-end speech recognition. Proceedings of Interspeech. Graz: Interspeech, 2019. 2623–2627.
    [13] Zhang Q, Lu H, Sak H, et al. Transformer transducer: A streamable speech recognition model with Transformer encoders and RNN-T loss. Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 7829–7833.
    [14] Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2015. 28.
    [15] Chang XK, Zhang WY, Qian YM, et al. End-to-end multi-speaker speech recognition with Transformer. Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 6134–6138.
    [16] Dong LH, Xu B. CIF: Continuous integrate-and-fire for end-to-end speech recognition. Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 6079–6083.
    [17] Gulati A, Qin J, Chiu CC, et al. Conformer: Convolution-augmented Transformer for speech recognition. Proceedings of the Interspeech. Shanghai: Interspeech, 2020. 5036–5040.
    [18] Peng YF, Dalmia S, Lane I, et al. Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding. Proceedings of the 39th International Conference on Machine Learning. Baltimore: PMLR, 2022. 17627–17643.
    [19] 徐沁, 梁玉莲, 王冬越, 等. 基于SE-Res2Net与多尺度空谱融合注意力机制的高光谱图像分类. 计算机辅助设计与图形学学报, 2021, 33(11): 1726–1734.
    [20] Zhao YZ, Ni CJ, Leung CC, et al. Universal speech Transformer. Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai: Curran Associates Inc., 2020. 5021–5025.
    [21] HeZR, Shen QF, Wu JX, et al. Transformer encoder-based multilevel representations with fusion feature input for speech emotion recognition. Journal of Southeast University (English Edition), 2023, 39(1): 68–73.
    [22] 王明. 基于元学习的多头注意力时序卷积的入侵检测. 网络安全与数据治理, 2023, 42(7): 49–54.
    [23] 张英, 拥措, 于韬. 基于动态多头注意力机制的藏文语言模型. 计算机工程与设计, 2023, 44(12): 3707–3713.
    [24] 钟雨昂, 袁伟伟, 关东海. 基于Softmax的加权Double Q-Learning算法. 计算机科学, 2024, 51(6A): 230600235.
    [25] 翁鸣昊, 项兴华, 陈俊涛, 等. 基于LSTM与Transformer的大坝变形预测研究. 中国农村水利水电, 2024(4): 250–257.
    [26] 崔晨露, 崔琳. 面向数据增强的轻量化语音情感识别. 计算机与现代化, 2023(4): 83–89, 100.
    [27] 陈戈, 谢旭康, 孙俊, 等. 使用Conformer增强的混合CTC/Attention端到端中文语音识别. 计算机工程与应用, 2023, 59(4): 97–103.
    [28] Dai ZH, Yang ZL, Yang YM, et al. Transformer-XL: Attentive language models beyond a fixed-length context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 2978–2988.
    [29] Hu J, Shen L, Albanie S, et al. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011–2023.
    [30] 刘霞, 王迪. 深度ReLU神经网络的万有一致性. 中国科学: 信息科学, 2024, 54(3): 638–652.
    [31] 崔琳, 王芷悦. 基于LFBank与FBank混合特征的声纹识别研究. 计算机科学, 2022, 49(11A): 211000194.
    [32] Chan W, Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai: IEEE, 2016. 4960–4964.
    [33] Tian ZK, Yi JY, Bai Y, et al. Synchronous Transformers for end-to-end speech recognition. Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020. 7884–7888.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

马永杰,李罡.基于Conformer-SE的端到端语音识别.计算机系统应用,2024,33(12):106-114

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:May 28,2024
  • Revised:June 26,2024
  • Online: October 31,2024
Article QR Code
You are the first990935Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063