本文已被:浏览 4124次 下载 5321次
Received:March 09, 2011 Revised:March 30, 2011
Received:March 09, 2011 Revised:March 30, 2011
中文摘要: 限定领域的语言模型训练语料的搜集需要耗费大量的人力物力,如果语料搜集不充分,往往会造成数据稀疏的问题。解决该问题的方法有两种:1、采用数据平滑算法,降低模型的困惑度;2、对训练语料进行扩展。探索了对语言模型的训练语料进行半自动扩展的方法。该方法通过计算互信息将非限定领域的大规模语料分成若干词类,生成大词类表;再将该表中领域相关的词类提取出来,进行手动删减之后用于对限定领域的语言模型进行参数估计。实验表明,将该方法用于语音识别系统,能有效缩短语言模型训练语料的搜集时间,提高系统的识别率。
Abstract:It is time-consuming to collect training corpus of language modal in restricted domain. The insufficiency of corpus will lead to the problem of training data sparsity. There are two common methods to solve this problem. One is reducing the complexion of modal through data smoothing. The other is expanding the corpus. In this paper, a semiautomatic method to expand training corpus of language modal is proposed. A large list of word classes is generated by calculating the mutual information of non-restricted areas corpus in large scale. Then, those word classes related to the restricted domain is extracted and manually cut out to estimate parameters of language modal. Experimental results show that the method could effectively solve the problem of training data sparsity, and improve the recognition rate of speech recognition system.
文章编号: 中图分类号: 文献标志码:
基金项目:
引用文本:
黄韵竹,韦玮,罗杨宇,李成荣.限定领域语言模型训练语料的词类扩展方法.计算机系统应用,2011,20(11):55-58
HUANG Yun-Zhu,WEI Wei,LUO Yang-Yu,LI Cheng-Rong.Word-Class Expansion Method About Training Corpus of Language Modal in Restrcited Domain.COMPUTER SYSTEMS APPLICATIONS,2011,20(11):55-58
黄韵竹,韦玮,罗杨宇,李成荣.限定领域语言模型训练语料的词类扩展方法.计算机系统应用,2011,20(11):55-58
HUANG Yun-Zhu,WEI Wei,LUO Yang-Yu,LI Cheng-Rong.Word-Class Expansion Method About Training Corpus of Language Modal in Restrcited Domain.COMPUTER SYSTEMS APPLICATIONS,2011,20(11):55-58