Abstract:It is time-consuming to collect training corpus of language modal in restricted domain. The insufficiency of corpus will lead to the problem of training data sparsity. There are two common methods to solve this problem. One is reducing the complexion of modal through data smoothing. The other is expanding the corpus. In this paper, a semiautomatic method to expand training corpus of language modal is proposed. A large list of word classes is generated by calculating the mutual information of non-restricted areas corpus in large scale. Then, those word classes related to the restricted domain is extracted and manually cut out to estimate parameters of language modal. Experimental results show that the method could effectively solve the problem of training data sparsity, and improve the recognition rate of speech recognition system.