计算机系统应用  2018, Vol. 27 Issue (10): 33-38 PDF
N-gram模型综述

Survey on N-gram Model
YIN Chen, WU Min
School of Software Engineering, University of Science and Technology of China, Hefei 230051, China
Abstract: The N-gram model is one of the most commonly used language models in natural language processing and is widely used in many tasks such as speech recognition, handwriting recognition, spelling correction, machine translation and search engines. However, the N-gram model often presents zero-probability problems in training and application, resulting in failure to obtain a good language model. As a result, smoothing methods such as Laplace smoothing, Katz back-off, and Kneser-Ney smoothing appeared. After introducing the basic principles of these smoothing methods, we use the perplexity as a metric to compare the language models trained based on these types of smoothing methods.
Key words: N-gram model     Laplace smoothing     Katz back-off     Kneser-Ney smoothing     perplexity

S1: I am watching movie now.

S2: watching I am now movie.

N-gram模型会预测句子S1的概率 $p(S1)$ 大于句子S2的概率 $p(S2)$ .

S3: Their are so many people.

S4: There are so many people.

S5: 我今天要去上学.

S6: I today want go on learn.

S7: I today want go to school.

S8: I am going to school today.

N-gram模型常用于估算给定语句在语料库中出现的概率, 一个N-gram是一个长为N的词序列. 当N=1时称为Unigram模型即一元模型, 也叫上下文无关模型; 当N=2时称为bigram模型即二元模型; 当N=3时称为trigram模型即三元模型.

1 N-gram模型

N-gram模型的基本原理是基于马尔可夫假设, 在训练N-gram模型时使用最大似然估计模型参数——条件概率.

1.1 基本原理

 $p(S) = p({w_1}{w_2}{w_3}\cdots{w_n}) = p(w_1^n)$ (1)

 $\begin{array}{l} p(w_1^n) = p({w_1})p({w_2}|{w_1})p({w_3}|w_1^2) \cdots ({w_n}|w_1^{n - 1})\\ = \prod\nolimits_{k = 1}^n {p({w_k}|w_1^{k - 1})} \end{array}$ (2)

 $p({w_n}|w_1^{n - 1}) \approx p({w_n}|{w_{n - 1}})$ (3)

 \begin{aligned}p(w_1^n) & \approx p({w_1})p({w_2}|{w_1})p({w_3}|{w_2}) \cdots p({w_n}|{w_{n - 1}})\\& = \prod\nolimits_{k = 1}^n {p({w_k}|{w_{k - 1}})} \end{aligned} (4)

 $p({w_n}|w_1^{n - 1}) \approx p({w_n}|w_{n - N + 1}^{n - 1})$ (5)

1.2 参数估计

 $p({w_i}|{w_{i - 1}}) = \frac{{p({w_{i - 1}}, {w_i})}}{{p({w_i})}} = \frac{{c({w_{i - 1}}, {w_i})}}{{c({w_i})}}$ (6)

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham </s>

 $p(I| < s > ) = \frac{2}{3} = 0.67$
 $p(Sam| < s > ) = \frac{1}{3} = 0.33$
 $p(am|I) = \frac{2}{3} = 0.67$

S9: I will run for the governor.

4 总结

N-gram模型在目前的自然语言处理任务中使用最为广泛的统计语言模型, 它的显著缺点是无法避免的零概率问题, 有时可以通过增大语料库规模来解决, 但是自然语言是一种动态的语言, 所以从概率计算的方法这一角度来解决零概率问题是在目前是比较科学的. 经过实验对于规模较小(语料库规模在3万以内)的语言处理任务, 卡茨回退平滑方法效果就已经很明显, 在接下来的研究中, 我们将重点放在优化与综合利用卡茨回退和Kneser-Ney平滑方法, 利用N-gram模型构建一个提取文章潜在语义的英语作文批改系统.

 [1] Jurafsky D, Martin JH. Speech and Language Processing. Pearson International Edition, 2014, 9. [2] Doddington G. Automatic evaluation of machine translation quality using N-gram co-occurrence statics. HLT ’02 Proceeding of the Second International Conference on Human Language Technology Research. 2002. 138–145. [3] Lin CY, Hovy E. Automatic evaluation of summaries using N-gram co-occurrence statics. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (NAACL’03). 2003. 71–78. [4] Brown PF, deSouza PV, Mercer RL, et al. Computational Linguistics. Cambridge, MA, USA: MIT Press, 1992, 18(4): 467–479. https://dl.acm.org/citation.cfm?id=176313. [5] Cavnar WB, Trenkle JM. N-gram-based text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. 1994. 161–175. [6] 吴军. 数学之美. 北京: 人民邮电出版社, 2012. 28–39. [7] Bird S, Klein E, Loper E. Natural Language Processing with Python. 陈涛, 张旭, 崔杨, 刘海平, 译. 北京: 人民邮电出版社, 2014: 195–197, 221–228. [8] https://en.wikipedia.org/wiki/N-gram. [9] https://en.wikipedia.org/wiki/Katz%27s_back-off_model [10] 王晓龙, 关毅. 计算机自然语言处理.北京: 清华大学出版社, 2005: 56.