多阶段生成器与时频鉴别器的GAN语音增强算法

doi:10.15888/j.cnki.csa.008587

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年7月27日 23:20 星期日

首页 > 过刊浏览>2022年第31卷第7期 >179-185. DOI:10.15888/j.cnki.csa.008587

PDF HTML阅读 XML下载导出引用引用提醒

多阶段生成器与时频鉴别器的GAN语音增强算法
DOI:
                        10.15888/j.cnki.csa.008587
                    
CSTR:
                        
                    
作者:
                        陈宇陈宇
公安部第一研究所, 北京 100048
在期刊界中查找
在百度中查找
在本站中查找
尹文兵尹文兵
武汉大学 国家多媒体软件工程技术研究中心, 武汉 430072
在期刊界中查找
在百度中查找
在本站中查找
高戈高戈
武汉大学 国家多媒体软件工程技术研究中心, 武汉 430072
在期刊界中查找
在百度中查找
在本站中查找
王霄王霄
武汉大学 国家多媒体软件工程技术研究中心, 武汉 430072
在期刊界中查找
在百度中查找
在本站中查找
曾邦曾邦
武汉大学 国家多媒体软件工程技术研究中心, 武汉 430072
在期刊界中查找
在百度中查找
在本站中查找
陈怡陈怡
华中师范大学 计算机学院, 武汉 430077
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

GAN Speech Enhancement Algorithm with Multi-stage Generator and Time-frequency Discriminator

Author:

CHEN Yu
CHEN Yu
Frist Research Institute of the Ministry of Public Security of PRC, Beijing 100048, China
在期刊界中查找
在百度中查找
在本站中查找
YIN Wen-Bing
YIN Wen-Bing
National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan 430072, China
在期刊界中查找
在百度中查找
在本站中查找
GAO Ge
GAO Ge
National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan 430072, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Xiao
WANG Xiao
National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan 430072, China
在期刊界中查找
在百度中查找
在本站中查找
ZENG Bang
ZENG Bang
National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan 430072, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Yi
CHEN Yi
School of Computer Science, Central China Normal University, Wuhan 430077, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [29]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

传统生成对抗网络的语音增强算法(SEGAN)将时域语音波形作为映射目标, 在低信噪比条件下, 语音时域波形会淹没在噪声中, 导致SEGAN的增强性能会急剧下降, 语音失真现象较为严重. 针对该问题, 提出了一种多阶段的时频域生成对抗网络的语音增强算法(multi-stage-time-frequency SEGAN, MS-TFSEGAN). MS-TFSEGAN采用了多阶段生成器与时频域双鉴别器的模型结构, 不断对映射结果进行完善, 同时捕获时域与频域信息. 另外, 为了进一步提升模型对频域细节信息的学习能力, MS-TFSEGAN在生成器损失函数中引入了频域L1损失. 实验证明, 在低信噪比条件下, MS-TFSEGAN的语音质量和可懂度与SEGAN相比分别提升了约13.32%和8.97%, 作为语音识别前端时在CER上实现了7.3%的相对提升.

关键词:语音增强;生成对抗网络;低信噪比;语音质量;语音可懂度;语音识别;多阶段模型;深度学习

Abstract:

The traditional speech enhancement generative adversarial network (SEGAN) takes the waveform of time-domain speech as the mapping target. When it comes to a low signal-to-noise ratio, the waveform of time-domain speech is drowned in the noise, resulting in a dramatic degradation of the enhancement performance of SEGAN and more serious speech distortion. In response, a multi-stage-time-frequency SEGAN (MS-TFSEGAN) is proposed for speech enhancement. MS-TFSEGAN employs multi-stage generators with dual time-frequency discriminators to continuously refine the mapping results. It captures both time- and frequency-domain information at the same time. In addition, for the further enhancement of learning ability in the frequency domain, MS-TFSEGAN introduces L1 loss in the generator loss function. Experimental results show that the speech quality and intelligibility of MS-TFSEGAN are improved by about 13.32% and 8.97%, respectively, compared with SEGAN under low SNR. A relative improvement of 7.3% in CER is achieved when MS-TFSEGAN is used as the front-end of speech recognition.

Key words:speech enhancement;generative adversarial network;low signal-to-noise ratio;speech quality;speech intelligibility;speech recognition;multi-stage model;deep learning

参考文献

[1] Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113–120. [doi: 10.1109/TASSP.1979.1163209

[2] Li N, Loizou PC. Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction. The Journal of the Acoustical Society of America, 2008, 123(3): 1673–1682. [doi: 10.1121/1.2832617

[3] Lim J, Oppenheim A. All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978, 26(3): 197–210. [doi: 10.1109/TASSP.1978.1163086

[4] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(6): 1109–1121. [doi: 10.1109/TASSP.1984.1164453

[5] Xu Y, Du J, Dai LR, et al. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1): 7–19. [doi: 10.1109/TASLP.2014.2364452

[6] Mamun N, Khorram S, Hansen JHL. Convolutional neural network-based speech enhancement for cochlear implant recipients. Proceedings of the INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association. Graz: ISCA, 2019. 4265?4269.

[7] Zhao H, Zarar S, Tashev I, et al. Convolutional-recurrent neural networks for speech enhancement. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 2401–2405.

[8] Weninger F, Erdogan H, Watanabe S, et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Proceedings of the 12th International Conference on Latent Variable Analysis and Signal Separation. Liberec: Springer, 2015. 91–99.

[9] Pandey A, Wang D. A new framework for supervised speech enhancement in the time domain. Proceedings of the INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association. Hyderabad: ISCA, 2018. 1136–1140.

[10] Stoller D, Ewert S, Dixon S. Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR). Paris: ISBN, 2018. 334–340.

[11] Rethage D, Pons J, Serra X. A wavenet for speech denoising. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5069–5073.

[12] Kim HY, Yoon JW, Cheon SJ, et al. A multi-resolution approach to GAN-based speech enhancement. Applied Sciences, 2021, 11(2): 721. [doi: 10.3390/app11020721

[13] Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2014. 2672–2680.

[14] Hsu CC, Hwang HT, Wu YC, et al. Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv: 1704.00849, 2017.

[15] Saito Y, Takamichi S, Saruwatari H. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(1): 84–96. [doi: 10.1109/TASLP.2017.2761547

[16] Pascual S, Bonafonte A, Serrà J. SEGAN: Speech enhancement generative adversarial network. Proceedings of the INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association. Stockholm: ISCA, 2017. 3642–3646.

[17] Gulrajani I, Ahmed F, Arjovsky M, et al. Improved training of Wasserstein GANs. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 5769–5779.

[18] Karras T, Aila T, Laine S, et al. Progressive growing of GANs for improved quality, stability, and variation. arXiv: 1710.10196, 2018.

[19] Karras T, Laine S, Aittala M, et al. Analyzing and improving the image quality of styleGAN. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 8107–8116.

[20] Phan H, Mcloughlin IV, Pham L, et al. Improving GANs for speech enhancement. IEEE Signal Processing Letters, 2020, 27: 1700–1704. [doi: 10.1109/LSP.2020.3025020

[21] 尹文兵. 基于生成对抗网络的语音增强技术研究[博士学位论文]. 武汉: 武汉大学, 2021.

[22] ITU-T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs: ITU-T P. 862. (2001-02-23).

[23] Jensen J, Taal CH. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11): 2009–2022. [doi: 10.1109/TASLP.2016.2585878

[24] Mirza M, Osindero S. Conditional generative adversarial nets. arXiv: 1411.1784, 2014.

[25] Mao XD, Li Q, Xie HR, et al. Least squares generative adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 2813–2821.

[26] Quan TM, Nguyen-Duc T, Jeong WK. Compressed sensing MRI reconstruction using a generative adversarial network with a cyclic loss. IEEE Transactions on Medical Imaging, 2018, 37(6): 1488–1497. [doi: 10.1109/TMI.2018.2820120

[27] Isola P, Zhu JY, Zhou TH, et al. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 5967–5976.

[28] Pandey A, Wang DL. On adversarial training and loss functions for speech enhancement. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 5414–5418.

[29] Snyder D, Chen GG, Povey D. MUSAN: A music, speech, and noise corpus. arXiv: 1510.08484, 2015.

引用本文

陈宇,尹文兵,高戈,王霄,曾邦,陈怡.多阶段生成器与时频鉴别器的GAN语音增强算法.计算机系统应用,2022,31(7):179-185

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-10-14
最后修改日期:2021-11-12
录用日期:
在线发布日期: 2022-05-31
出版日期:

微信公众号

网站二维码

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码