语音合成及伪造、鉴伪技术综述
作者:

Overview on Speech Synthesis, Forgery and Detection Technology
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [63]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    近年来随着移动智能设备的兴起, 人们越来越频繁的接触和使用语音信息, 语音伪造和鉴伪成为语音处理领域中愈加重要的技术. 本文首先梳理了语音合成系统的一般流程, 并对语音伪造领域中主要的文本到语音(text-to-speech, TTS)和语音转换(voice conversion, VC)两项技术进行系统归纳; 接着, 对语音鉴伪技术中常见的算法进行介绍和分类; 最后, 针对语音伪造和鉴伪目前存在的问题, 本文从数据、模型、训练方法以及应用场景等多个角度出发提出未来可能的发展方向.

    Abstract:

    In recent years, with the rise of mobile intelligent devices, people contact and use voice information more and more frequently. Voice forgery and its detection have become increasingly important technologies in the field of voice processing. Firstly, this study clarifies the general process of a voice generation system and systematically summarizes the two main technologies, text-to-speech (TTS) and voice conversion (VC), in the field of voice forgery. Then, the common algorithms in voice forgery detection technology are introduced and classified. Finally, to tackle the existing problems in voice forgery and its detection, this study puts forward possible development directions from the perspectives of data, models, training methods and application scenarios.

    参考文献
    [1] Taylor P. Text-to-speech synthesis. Cambridge: Cambridge University Press, 2009.
    [2] Nakashika T, Takashima R, Takiguchi T, et al. Voice conversion in high-order Eigen space using deep belief nets. Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon: ISCA, 2013. 369–372.
    [3] Morise M, Yokomori F, Ozawa K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 2016, E99-D(7): 1877–1884. [doi: 10.1587/transinf.2015EDP7457
    [4] Abe M, Nakamura S, Shikano K, et al. Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 1990, 11(2): 71–76. [doi: 10.1250/ast.11.71
    [5] Toda T, Black AW, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8): 2222–2235. [doi: 10.1109/TASL.2007.907344
    [6] Morizane K, Nakamura K, Toda T, et al. Emphasized speech synthesis based on hidden Markov models. Proceedings of 2009 Oriental COCOSDA International Conference on Speech Database and Assessments. Urumqi: IEEE, 2009. 76–81.
    [7] Tokuda K, Nankaku Y, Toda T, et al. Speech synthesis based on hidden Markov models. Proceedings of the IEEE, 2013, 101(5): 1234–1252. [doi: 10.1109/JPROC.2013.2251852
    [8] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [9] Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 1999, 27(3–4): 187–207. [doi: 10.1016/S0167-6393(98)00085-5
    [10] van den Oord A, Dieleman S, Zen H, et al. WaveNet: A generative model for raw audio. Proceedings of the 9th ISCA Speech Synthesis Workshop. Sunnyvale: ISCA, 2016. 125.
    [11] van den Oord A, Li YZ, Babuschkin I, et al. Parallel wavenet: Fast high-fidelity speech synthesis. Proceedings of the 35th International Conference on Machine Learning. Stockholm: JMLR, 2018. 3918–3926.
    [12] Prenger R, Valle R, Catanzaro B. Waveglow: A flow-based generative network for speech synthesis. Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 3617–3621.
    [13] Yin ZG. A simplified overview of TTS techniques. Proceedings of the 2nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017). 2017. 2017. 165–171.
    [14] Browman CP, Goldstein L. Tiers in articulatory phonology, with some implications for casual speech. Kingston J, Beckman ME. Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech. New York: Cambridge University Press, 1990. 341–376.
    [15] Tabet Y, Boughazi M. Speech synthesis techniques. A survey. Proceedings of the International Workshop on Systems, Signal Processing and Their Applications, WOSSPA. Tipaza: IEEE, 2011. 67–70.
    [16] Dutoit T. High quality text-to-speech synthesis: An overview. Journal of Electrical and Electronics Engineering, 1997, 17(1): 25–37
    [17] Yoshimura T, Tokuda K, Masuko T, et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. Proceedings of the 6th European Conference on Speech Communication and Technology. Budapest: ISCA, 1999.
    [18] Toda T, Black AW, Tokuda K. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication, 2008, 50(3): 215–227. [doi: 10.1016/j.specom.2007.09.001
    [19] Kayte S, Mundada M, Gujrathi J. Hidden Markov model based speech synthesis: A review. International Journal of Computer Applications, 2015, 130(3): 35–39. [doi: 10.5120/ijca2015906965
    [20] Zen HG, Senior A, Schuster M. Statistical parametric speech synthesis using deep neural networks. Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2013. 7962–7966.
    [21] Kang SY, Qian XJ, Meng H. Multi-distribution deep belief network for speech synthesis. Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2013. 8012–8016.
    [22] Fan YC, Qian Y, Xie FL, et al. TTS synthesis with bidirectional LSTM based recurrent neural networks. Proceedings of the INTERSPEECH 2014. 2014. 1964–1968.
    [23] Arik S?, Chrzanowski M, Coates A, et al. Deep voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning. Sydney: JMLR, 2017. 195–204.
    [24] Gibiansky A, Arik S?, Diamos GF, et al. Deep voice 2: Multi-speaker neural text-to-speech. Proceedings of the Annual Conference on Neural Information Processing Systems 2017. Long Beach, 2017. 2962–2970.
    [25] Ping W, Peng KN, Gibiansky A, et al. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. Proceedings of the 6th International Conference on Learning Representations. Vancouver, 2018.
    [26] Wang YX, Skerry-Ryan RJ, Stanton D, et al. Tacotron: Towards end-to-end speech synthesis. Proceedings of the INTERSPEECH 2017. 2017. 4006–4010.
    [27] Shen J, Pang RM, Weiss RJ, et al. Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018. 4779–4783.
    [28] Sisman B, Yamagishi J, King S, et al. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29: 132–157. [doi: 10.1109/TASLP.2020.3038524
    [29] 张雄伟, 苗晓孔, 曾歆, 等. 语音转换技术研究现状及展望. 数据采集与处理, 2019, 34(5): 753–770
    [30] Matsumoto H, Yamashita Y. Unsupervised speaker adaptation from short utterances based on a minimized fuzzy objective function. Journal of the Acoustical Society of Japan (E), 1993, 14(5): 353–361. [doi: 10.1250/ast.14.353
    [31] 赵玲丽. 基于高斯混合模型的语音转换技术研究[硕士学位论文]. 南京: 南京邮电大学, 2011.
    [32] Takamichi S, Toda T, Black AW, et al. Modulation spectrum-based post-filter for GMM-based voice conversion. Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. Siem Reap: IEEE, 2014. 1–4.
    [33] Valbret H, Moulines E, Tubach JP. Voice transformation using PSOLA technique. Speech Communication, 1992, 11(2–3): 175–187. [doi: 10.1016/0167-6393(92)90012-V
    [34] Sündermann D, Strecha G, Bonafonte A, et al. Evaluation of VTLN-based voice conversion for embedded speech synthesis. Proceedings of the 9th European Conference on Speech Communication and Technology. Lisbon: ISCA, 2005. 2593–2596.
    [35] Tian XH, Lee SW, Wu ZZ, et al. An exemplar-based approach to frequency warping for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1863–1876. [doi: 10.1109/TASLP.2017.2723721
    [36] Duxans H, Erro D, Pérez J, et al. Voice conversion of non-aligned data using unit selection. Proceedings of the TC-STAR Workshop. Barcelona, 2006.
    [37] Erro D, Moreno A, Bonafonte A. INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 944–953. [doi: 10.1109/TASL.2009.2038669
    [38] Stylianou Y, Cappé O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 1998, 6(2): 131–142. [doi: 10.1109/89.661472
    [39] Wu ZZ, Kinnunen T, Chng ES, et al. Mixture of factor analyzers using priors from non-parallel speech for voice conversion. IEEE Signal Processing Letters, 2012, 19(12): 914–917. [doi: 10.1109/LSP.2012.2225615
    [40] Ghahramani Z, Hinton GE. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Toronto: University of Toronto, 1996.
    [41] Xie FL, Qian Y, Soong FK, et al. Pitch transformation in neural network based voice conversion. Proceedings of the 9th International Symposium on Chinese Spoken Language Processing. Singapore: IEEE, 2014. 197–200.
    [42] Chen LH, Ling ZH, Liu LJ, et al. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12): 1859–1872. [doi: 10.1109/TASLP.2014.2353991
    [43] Nakashika T, Takiguchi T, Ariki Y. High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. Proceedings of the 15th Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014. 2278–2282.
    [44] Ming HP, Huang DY, Xie L, et al. Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco: ISCA, 2016. 2453–2457.
    [45] Hsu CC, Hwang HT, Wu YC, et al. Voice conversion from non-parallel corpora using variational auto-encoder. Proceedings of 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Jeju: IEEE, 2016. 1–6.
    [46] Chou JC, Lee HY. One-shot voice conversion by separating speaker and content representations with instance normalization. Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz: ISCA, 2019. 664–668.
    [47] Zhang JX, Ling ZH, Dai LR. Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 28: 540–552
    [48] Lorenzo-Trueba J, Yamagishi J, Toda T, et al. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. Proceedings of the Speaker and Language Recognition Workshop. Les Sables d’Olonne: ISCA, 2018. 195–202.
    [49] Zhu JY, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 2242–2251.
    [50] Kaneko T, Kameoka H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. Proceedings of the 26th European Signal Processing Conference (EUSIPCO). Rome: IEEE, 2018. 2100–2104.
    [51] Kaneko T, Kameoka H, Tanaka K, et al. Cyclegan-VC2: Improved cyclegan-based non-parallel voice conversion. Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2019. 6820–6824.
    [52] Kameoka H, Kaneko T, Tanaka K, et al. StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks. Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT). Athens: IEEE, 2018. 266–273.
    [53] Kaneko T, Kameoka H, Tanaka K, et al. StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion. Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz: ISCA, 2019. 679–683.
    [54] Zhou K, Sisman B, Li HZ. Transforming spectrum and prosody for emotional voice conversion with non-parallel training data. Proceedings of the Speaker and Language Recognition Workshop. Tokyo: ISCA, 2020. 230–237.
    [55] Kumar K, Kumar R, de Boissiere T, et al. MelGAN: Generative adversarial networks for conditional waveform synthesis. Proceedings of the Annual Conference on Neural Information Processing Systems 2019. Vancouver, 2019. 14881–14892.
    [56] Patel TB, Patil HA. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. Proceedings of the INTERSPEECH 2015. Dresden, 2015. 2062–2066.
    [57] Villalba J, Miguel A, Ortega A, et al. Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. Proceedings of the INTERSPEECH 2015. Dresden, 2015. 2067–2071.
    [58] Gomez-Alanis A, Peinado AM, Gonzalez JA, et al. A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection. Proceedings of the INTERSPEECH 2019. Graz, 2019. 1068–1072.
    [59] Jung JW, Shim HJ, Heo HS, et al. Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge. Proceedings of the INTERSPEECH 2019. Graz, 2019. 1083–1087.
    [60] Zeinali H, Stafylakis T, Athanasopoulou G, et al. Detecting spoofing attacks using VGG and sincNet: BUT-omilia submission to ASVspoof 2019 challenge. Proceedings of the INTERSPEECH 2019. Graz, 2019. 1073–1077.
    [61] Monteiro J, Alam J, Falk TH. End-to-end detection of attacks to automatic speaker recognizers with time-attentive light convolutional neural networks. Proceedings of 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). Pittsburgh: IEEE, 2019. 1–6.
    [62] Chettri B, Stoller D, Morfi V, et al. Ensemble models for spoofing detection in automatic speaker verification. Proceedings of the INTERSPEECH 2019. Graz, 2019. 1018–1022.
    [63] He KM, Fan HQ, Wu YX, et al. Momentum contrast for unsupervised visual representation learning. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9726–9735.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

杨帅,乔凯,陈健,王林元,闫镔.语音合成及伪造、鉴伪技术综述.计算机系统应用,2022,31(7):12-22

复制
分享
文章指标
  • 点击次数:1550
  • 下载次数: 7867
  • HTML阅读次数: 9139
  • 引用次数: 0
历史
  • 收稿日期:2021-10-08
  • 最后修改日期:2021-11-08
  • 在线发布日期: 2022-05-31
文章二维码
您是第11125307位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号