深度复数轴向自注意力卷积循环网络的语音增强
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

甘肃省重点研发计划(22YF7GA130)


Speech Enhancement Based on Deep Complex Axial Self-attention Convolutional RecurrentNetwork
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    单通道语音增强任务中相位估计不准确会导致增强语音的质量较差, 针对这一问题, 提出了一种基于深度复数轴向自注意力卷积循环网络(deep complex axial self-attention convolutional recurrent network, DCACRN)的语音增强方法, 在复数域同时实现了语音幅度信息和相位信息的增强. 首先使用基于复数卷积网络的编码器从输入语音信号中提取复数表示的特征, 并引入卷积跳连模块用以将特征映射到高维空间进行特征融合, 加强信息间的交互和梯度的流动. 然后设计了基于轴向自注意力机制的编码器-解码器结构, 利用轴向自注意力机制来增强模型的时序建模能力和特征提取能力. 最后通过解码器实现对语音信号的重构, 同时利用混合损失函数优化网络模型, 提升增强语音信号的质量. 实验在公开数据集Valentini和DNS Challenge上进行, 结果表明所提方法相对于其他模型在客观语音质量评估(perceptual evaluation of speech quality, PESQ)和短时客观可懂度(short-time objective intelligibility, STOI)两项指标上均有提升, 在非混响数据集中, PESQ比DCTCRN (deep cosine transform convolutional recurrent network)提高了12.8%, 比DCCRN (deep complex convolutional recurrent network)提高了3.9%, 验证了该网络模型在语音增强任务中的有效性.

    Abstract:

    Inaccurate phase estimation in single-channel speech enhancement tasks will cause poor quality of the enhanced speech. To this end, this study proposes a speech enhancement method based on a deep complex axial self-attention convolutional recurrent network (DCACRN), which enhances speech amplitude information and phase information in the complex domain simultaneously. Firstly, a complex convolutional network-based encoder is employed to extract complex features from the input speech signal, and a convolutional hopping module is introduced to map the features into a high-dimensional space for feature fusion, which enhances the information interaction and the gradient flow. Then an encoder-decoder structure based on the axial self-attention mechanism is designed to enhance the model’s timing modeling ability and feature extraction ability. Finally, the reconstruction of the speech signals is realized by the decoder, while the hybrid loss function is adopted to optimize the network model to improve the quality of enhanced speech signals. Meanwhile, the mixed loss function is utilized to optimize the network model and improve the quality of enhanced speech signals. The experiments are conducted on the public datasets Valentini and DNS Challenge, and the results show that the proposed method improves both the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) metrics compared to other models. In the non-reverberant dataset, PESQ is improved by 12.8% over DCTCRN and 3.9% over DCCRN, which validates the effectiveness of the proposed model in speech enhancement tasks.

    参考文献
    相似文献
    引证文献
引用本文

曹洁,王乔,梁浩鹏,王宸章,李晓旭,于泓.深度复数轴向自注意力卷积循环网络的语音增强.计算机系统应用,2024,33(4):60-68

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-10-07
  • 最后修改日期:2023-11-09
  • 录用日期:
  • 在线发布日期: 2024-01-18
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号