多阶段生成器与时频鉴别器的GAN语音增强算法
作者:

GAN Speech Enhancement Algorithm with Multi-stage Generator and Time-frequency Discriminator
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [29]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    传统生成对抗网络的语音增强算法(SEGAN)将时域语音波形作为映射目标, 在低信噪比条件下, 语音时域波形会淹没在噪声中, 导致SEGAN的增强性能会急剧下降, 语音失真现象较为严重. 针对该问题, 提出了一种多阶段的时频域生成对抗网络的语音增强算法(multi-stage-time-frequency SEGAN, MS-TFSEGAN). MS-TFSEGAN采用了多阶段生成器与时频域双鉴别器的模型结构, 不断对映射结果进行完善, 同时捕获时域与频域信息. 另外, 为了进一步提升模型对频域细节信息的学习能力, MS-TFSEGAN在生成器损失函数中引入了频域L1损失. 实验证明, 在低信噪比条件下, MS-TFSEGAN的语音质量和可懂度与SEGAN相比分别提升了约13.32%和8.97%, 作为语音识别前端时在CER上实现了7.3%的相对提升.

    Abstract:

    The traditional speech enhancement generative adversarial network (SEGAN) takes the waveform of time-domain speech as the mapping target. When it comes to a low signal-to-noise ratio, the waveform of time-domain speech is drowned in the noise, resulting in a dramatic degradation of the enhancement performance of SEGAN and more serious speech distortion. In response, a multi-stage-time-frequency SEGAN (MS-TFSEGAN) is proposed for speech enhancement. MS-TFSEGAN employs multi-stage generators with dual time-frequency discriminators to continuously refine the mapping results. It captures both time- and frequency-domain information at the same time. In addition, for the further enhancement of learning ability in the frequency domain, MS-TFSEGAN introduces L1 loss in the generator loss function. Experimental results show that the speech quality and intelligibility of MS-TFSEGAN are improved by about 13.32% and 8.97%, respectively, compared with SEGAN under low SNR. A relative improvement of 7.3% in CER is achieved when MS-TFSEGAN is used as the front-end of speech recognition.

    参考文献
    [1] Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113–120. [doi: 10.1109/TASSP.1979.1163209
    [2] Li N, Loizou PC. Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction. The Journal of the Acoustical Society of America, 2008, 123(3): 1673–1682. [doi: 10.1121/1.2832617
    [3] Lim J, Oppenheim A. All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978, 26(3): 197–210. [doi: 10.1109/TASSP.1978.1163086
    [4] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(6): 1109–1121. [doi: 10.1109/TASSP.1984.1164453
    [5] Xu Y, Du J, Dai LR, et al. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1): 7–19. [doi: 10.1109/TASLP.2014.2364452
    [6] Mamun N, Khorram S, Hansen JHL. Convolutional neural network-based speech enhancement for cochlear implant recipients. Proceedings of the INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association. Graz: ISCA, 2019. 4265?4269.
    [7] Zhao H, Zarar S, Tashev I, et al. Convolutional-recurrent neural networks for speech enhancement. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 2401–2405.
    [8] Weninger F, Erdogan H, Watanabe S, et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Proceedings of the 12th International Conference on Latent Variable Analysis and Signal Separation. Liberec: Springer, 2015. 91–99.
    [9] Pandey A, Wang D. A new framework for supervised speech enhancement in the time domain. Proceedings of the INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association. Hyderabad: ISCA, 2018. 1136–1140.
    [10] Stoller D, Ewert S, Dixon S. Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR). Paris: ISBN, 2018. 334–340.
    [11] Rethage D, Pons J, Serra X. A wavenet for speech denoising. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5069–5073.
    [12] Kim HY, Yoon JW, Cheon SJ, et al. A multi-resolution approach to GAN-based speech enhancement. Applied Sciences, 2021, 11(2): 721. [doi: 10.3390/app11020721
    [13] Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2014. 2672–2680.
    [14] Hsu CC, Hwang HT, Wu YC, et al. Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv: 1704.00849, 2017.
    [15] Saito Y, Takamichi S, Saruwatari H. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(1): 84–96. [doi: 10.1109/TASLP.2017.2761547
    [16] Pascual S, Bonafonte A, Serrà J. SEGAN: Speech enhancement generative adversarial network. Proceedings of the INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association. Stockholm: ISCA, 2017. 3642–3646.
    [17] Gulrajani I, Ahmed F, Arjovsky M, et al. Improved training of Wasserstein GANs. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 5769–5779.
    [18] Karras T, Aila T, Laine S, et al. Progressive growing of GANs for improved quality, stability, and variation. arXiv: 1710.10196, 2018.
    [19] Karras T, Laine S, Aittala M, et al. Analyzing and improving the image quality of styleGAN. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 8107–8116.
    [20] Phan H, Mcloughlin IV, Pham L, et al. Improving GANs for speech enhancement. IEEE Signal Processing Letters, 2020, 27: 1700–1704. [doi: 10.1109/LSP.2020.3025020
    [21] 尹文兵. 基于生成对抗网络的语音增强技术研究[博士学位论文]. 武汉: 武汉大学, 2021.
    [22] ITU-T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs: ITU-T P. 862. (2001-02-23).
    [23] Jensen J, Taal CH. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11): 2009–2022. [doi: 10.1109/TASLP.2016.2585878
    [24] Mirza M, Osindero S. Conditional generative adversarial nets. arXiv: 1411.1784, 2014.
    [25] Mao XD, Li Q, Xie HR, et al. Least squares generative adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 2813–2821.
    [26] Quan TM, Nguyen-Duc T, Jeong WK. Compressed sensing MRI reconstruction using a generative adversarial network with a cyclic loss. IEEE Transactions on Medical Imaging, 2018, 37(6): 1488–1497. [doi: 10.1109/TMI.2018.2820120
    [27] Isola P, Zhu JY, Zhou TH, et al. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 5967–5976.
    [28] Pandey A, Wang DL. On adversarial training and loss functions for speech enhancement. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5414–5418.
    [29] Snyder D, Chen GG, Povey D. MUSAN: A music, speech, and noise corpus. arXiv: 1510.08484, 2015.
    引证文献
引用本文

陈宇,尹文兵,高戈,王霄,曾邦,陈怡.多阶段生成器与时频鉴别器的GAN语音增强算法.计算机系统应用,2022,31(7):179-185

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-10-14
  • 最后修改日期:2021-11-12
  • 在线发布日期: 2022-05-31
文章二维码
您是第12460510位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号