基于CycleGAN的语音可懂度关键技术

doi:10.15888/j.cnki.csa.008541

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月4日 1:43 星期五

首页 > 过刊浏览>2022年第31卷第6期 >1-9. DOI:10.15888/j.cnki.csa.008541

PDF HTML阅读 XML下载导出引用引用提醒

基于CycleGAN的语音可懂度关键技术
DOI:
                        10.15888/j.cnki.csa.008541
                    
CSTR:
                        
                    
作者:
                        肖晶肖晶
武汉大学 计算机学院 国家多媒体软件工程技术研究中心, 武汉 430072;武汉大学 多媒体与网络通信工程湖北省重点实验室, 武汉 430072
在期刊界中查找
在百度中查找
在本站中查找
刘佳奇刘佳奇
武汉大学 计算机学院 国家多媒体软件工程技术研究中心, 武汉 430072;武汉大学 多媒体与网络通信工程湖北省重点实验室, 武汉 430072
在期刊界中查找
在百度中查找
在本站中查找
李登实李登实
江汉大学 人工智能学院, 武汉 430056
在期刊界中查找
在百度中查找
在本站中查找
赵兰馨赵兰馨
江汉大学 人工智能学院, 武汉 430056
在期刊界中查找
在百度中查找
在本站中查找
王前瑞王前瑞
江汉大学 人工智能学院, 武汉 430056
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家重点研发计划(1502-211100026)

Key Technologies of Speech Intelligibility Based on CycleGAN

Author:

XIAO Jing
XIAO Jing
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China;Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Jia-Qi
LIU Jia-Qi
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China;Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China
在期刊界中查找
在百度中查找
在本站中查找
LI Deng-Shi
LI Deng-Shi
School of Artificial Intelligence, Jianghan University, Wuhan 430056, China
在期刊界中查找
在百度中查找
在本站中查找
ZHAO Lan-Xin
ZHAO Lan-Xin
School of Artificial Intelligence, Jianghan University, Wuhan 430056, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Qian-Rui
WANG Qian-Rui
School of Artificial Intelligence, Jianghan University, Wuhan 430056, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [26]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

语音可懂度增强是一种在嘈杂环境中再现清晰语音的感知增强技术. 许多研究通过说话风格转换(SSC)来增强语音可懂度, 这种方法仅依靠伦巴第效应, 因此在强噪声干扰下效果不佳. SSC还利用简单的线性变换对基频(F0)的转换进行建模, 并且只映射很少维的梅尔倒谱系数(MCEPs). 因为F0和MCEPs是语音的两个重要特征, 对这些特征进行充分的建模是非常必要的. 因此本文进行了一个创新性研究即通过连续小波变换(CWT)将F0分解为10维来描述不同时间尺度的语音, 以实现F0的有效转换, 而且使用20维表示MCEPs实现MCEPs的转换. 除此之外, 还利用iMetricGAN网络来优化强噪声中的语音可懂度指标. 实验结果表明, 提出的基于CycleGAN使用CWT和iMetricGAN的非平行语音风格转换方法(NS-CiC)在客观和主观评价上均显著提高了强噪声环境下的语音可懂度.

关键词:深度学习;可懂度增强;连续小波变换;iMetricGAN;CycleGAN

Abstract:

Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Speaking style conversion (SSC) is used in many studies to achieve speech intelligibility, which relies solely on the Lombard effect and thus demonstrates poor performance with strong noise interference. In addition, the SSC method models the conversion of fundamental frequency (F0) with a straight forward linear transform and only maps Mel-frequency cepstral coefficients (MFCCs) with few dimensions. As F0 and MFCCs are critical aspects of hierarchical intonation, adequate modeling of these features is essential. Therefore, we use the continuous wavelet transform (CWT) to decompose F0 into ten dimensions to describe speech at different time scales for effective F0 conversion and represent MFCCs with 20 dimensions for MFCC conversion. Furthermore, we utilize an iMetricGAN to optimize speech intelligibility metrics in strong noise. The experimental results show that in objective and subjective evaluations, the proposed non-parallel speech style conversion method using CWT and iMetricGAN based on CycleGAN (NS-CiC) significantly increases speech intelligibility in robust noise environments.

Key words:deep learning;intelligibility?enhancement;continuous?wavelet?transform (CWT);iMetricGAN;CycleGAN

参考文献

[1] Kleijn WB, Crespo JB, Hendriks RC, et al. Optimizing speech intelligibility in a noisy environment: A unified view. IEEE Signal Processing Magazine, 2015, 32(2): 43–54. [doi: 10.1109/MSP.2014.2365594

[2] Taal CH, Hendriks RC, Heusdens R. Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Computer Speech & Language, 2014, 28(4): 858–872

[3] Licklider JCR, Pollack I. Effects of differentiation, integration, and infinite peak clipping upon the intelligibility of speech. The Journal of the Acoustical Society of America, 1948, 20(1): 42–51. [doi: 10.1121/1.1906346

[4] Arai T, Hodoshima N, Yasu K. Using steady-state suppression to improve speech intelligibility in reverberant environments for elderly listeners. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(7): 1775–1780. [doi: 10.1109/TASL.2010.2052165

[5] Kusumoto A, Arai T, Kinoshita K, et al. Modulation enhancement of speech by a pre-processing algorithm for improving intelligibility in reverberant environments. Speech Communication, 2005, 45(2): 101–113. [doi: 10.1016/j.specom.2004.06.003

[6] Aubanel V, Cooke M. Information-preserving temporal reallocation of speech in the presence of fluctuating maskers. Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon: ISCA, 2013. 3592–3596.

[7] Paul D, Shifas MPV, Pantazis Y, et al. Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai: ISCA, 2020. 1361–1365.

[8] Garnier M, Henrich N. Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? Computer Speech & Language, 2014, 28(2): 580–597

[9] Morise M, Yokomori F, Ozawa K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 2016, E99-D(7): 1877–1884. [doi: 10.1587/transinf.2015EDP7457

[10] Kawanami H, Iwami Y, Toda T, et al. GMM-based voice conversion applied to emotional speech synthesis. Proceedings of the 8th European Conference on Speech Communication and Technology. Geneva: ISCA, 2003. 208–211.

[11] Seshadri S, Juvela L, Räsänen O, et al. Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access, 2019, 7: 17230–17246. [doi: 10.1109/ACCESS.2019.2895923

[12] Ming HP, Huang DY, Xie L, et al. Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco: ISCA, 2016: 2453–2457

[13] Seshadri S, Juvela L, Yamagishi J, et al. Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion. Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019. 6835–6839.

[14] Ribeiro MS, Clark RAJ. A multi-level representation of F0 using the continuous wavelet transform and the discrete cosine transform. Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015. 4909–4913.

[15] Li HY, Fu SW, Tsao Y, et al. iMetricGAN: Intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning. Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai: ISCA, 2020. 1336–1340.

[16] Kruschke H, Lenz M. Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. Proceedings of the 8th European Conference on Speech Communication and Technology. Geneva: ISCA, 2003. 2881–2884.

[17] Mishra T, Van Santen J, Klabbers E. Decomposition of pitch curves in the general superpositional intonation model. Proceedings of the 3rd International Conference on Speech Prosody 2006. Dresden: ISCA, 2006.

[18] Sisman B, Li HZ. Wavelet analysis of speaker dependent and independent prosody for voice conversion. Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad: ISCA, 2018. 52–56.

[19] van Kuyk S, Kleijn WB, Hendriks RC. An evaluation of intrusive instrumental intelligibility metrics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 2153–2166. [doi: 10.1109/TASLP.2018.2856374

[20] Alghamdi A, Chan WY. Modified ESTOI for improving speech intelligibility prediction. Proceedings of 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). London: IEEE, 2020. 1–5.

[21] Alghamdi N, Maddock S, Marxer R, et al. A corpus of audio-visual Lombard speech with frontal and profile views. The Journal of the Acoustical Society of America, 2018, 143(6): EL523–EL529. [doi: 10.1121/1.5042758

[22] Soloducha M, Raake A, Kettler F, et al. Lombard speech database for German language. Proceedings of the 42nd Annual Conference on Acoustics. Aachen, 2016.

[23] Varga A, Steeneken HJM. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 1993, 12(3): 247–251

[24] Schädler MR. Optimization and evaluation of an intelligibility-improving signal processing approach (IISPA) for the Hurricane Challenge 2.0 with FADE. Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai: ISCA, 2020. 1331–1335.

[25] Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 2014, 7(3): 1247–1250. [doi: 10.5194/gmd-7-1247-2014

[26] Rec IP. 800: Methods for subjective determination of transmission quality. Geneva: ITU, 1996. 22.

引用本文

肖晶,刘佳奇,李登实,赵兰馨,王前瑞.基于CycleGAN的语音可懂度关键技术.计算机系统应用,2022,31(6):1-9

复制

文章指标

点击次数:1014
下载次数: 2046
HTML阅读次数: 1571
引用次数: 0

历史

收稿日期:2021-09-14
最后修改日期:2021-10-14
录用日期:
在线发布日期: 2022-05-26
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码