###

计算机系统应用英文版:2023,32(4):300-307

View/Add Comment 过刊浏览高级检索 HTML

←前一篇 | 后一篇→

码上扫一扫！

下载全文

不完全匹配的语音和文本语句级对齐

徐锴, 陶冶, 李辉

(青岛科技大学信息科学技术学院, 青岛 266061)

Sentence Level Text-speech Alignment for Imperfect Transcriptions

XU Kai, TAO Ye, LI Hui

(School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China)

摘要

图/表

参考文献

相似文献

本文已被：浏览 450次下载 981次
Received:September 07, 2022 Revised:October 21, 2022

中文摘要: 语音文本自动对齐技术广泛应用于语音识别与合成、内容制作等领域, 其主要目的是将语音和相应的参考文本在语句、单词、音素等级别的单元进行对齐, 并获得语音与参考文本之间的时间对位信息. 最新的先进对齐方法大多基于语音识别, 一方面, 准确率受限于语音识别效果, 识别字错误率高时文语对齐精度明显下降, 识别字错误率对对齐精度影响较大; 另一方面, 这种对齐方法不能有效处理不完全匹配的长篇幅语音和文本的对齐. 该文提出一种基于锚点和韵律信息的文语对齐方法, 通过基于边界锚点加权的片段标注将语料划分为对齐段和未对齐段, 针对未对齐段使用双门限端点检测方法提取韵律信息, 并检测语句边界, 降低了基于语音识别的对齐方法对语音识别效果的依赖程度. 实验结果表明, 与目前先进的基于语音识别的文语对齐方法比较, 即使在识别字错误率为0.52时, 该文所提方法的对齐准确率仍能提升45%以上; 在音频文本不匹配程度为0.5时, 该文所提方法能提高3%.

中文关键词: 语音文本对齐韵律信息锚点自动语音识别端点检测

Abstract:Automatic text-speech alignment technology is widely used in speech recognition and synthesis, content production, and other fields. Automatic text-speech alignment aims to align speech with text in sentence, word, and phoneme units and obtain the time alignment information. Most of the recent alignment methods are based on automatic speech recognition (ASR). On the one hand, the alignment accuracy is limited by the word error rate (WER) of ASR. On the other hand, such methods cannot effectively align imperfect transcriptions. This study proposes a text-speech alignment method based on anchor and prosodic information. Through fragment annotation based on boundary anchor weighting, speech is divided into aligned and unaligned fragments. For unaligned fragments, this study extracts their prosodic information by a dual-threshold endpoint detection method and detects the boundaries of sentences. This approach reduces the dependence of ASR-based text-speech alignment on the speech recognition effect. Compared with the current advanced ASR-based text-speech alignment methods, the proposed method can improve alignment accuracy by more than 45% when the WER is 0.52 and by at least 3% when the degree of incomplete matching is 0.5.

keywords: text-speech alignment prosodic information anchor automatic speech recognition (ASR) endpoint detection

文章编号： 中图分类号： 文献标志码：

基金项目:国家重点研发计划(2018YFB1702902); 山东省高等学校青创科技支持计划(2019KJN047)

Author Name	Affiliation	E-mail
XU Kai	School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China
TAO Ye	School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China	ye.tao@qust.edu.cn
LI Hui	School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China

Author Name	Affiliation	E-mail
XU Kai	School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China
TAO Ye	School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China	ye.tao@qust.edu.cn
LI Hui	School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China

引用文本：
徐锴,陶冶,李辉.不完全匹配的语音和文本语句级对齐.计算机系统应用,2023,32(4):300-307
XU Kai,TAO Ye,LI Hui.Sentence Level Text-speech Alignment for Imperfect Transcriptions.COMPUTER SYSTEMS APPLICATIONS,2023,32(4):300-307