本文已被:浏览 2271次 下载 2882次
Received:May 28, 2018 Revised:June 19, 2018
Received:May 28, 2018 Revised:June 19, 2018
中文摘要: NLTK是Python中用于自然语言处理的第三方模块,但处理中文文本具有一定局限性.利用NLTK对中文文本中的信息内容进行抽取与挖掘,采用同语境词提取、双连词搭配提取、概率统计以及篇章分析等方法,得到一个适用于中文文本的NLTK文本内容抽取框架,及其具体的实现方法.经实证分析表明,在抽取结果中可以找到反映文本特点的语料内容,得到抽取结果与文本主题具有较强相关性的结论.
Abstract:NLTK is a module for processing natural language text in Python, but it has limitations when processing Chinese text. To extracted information in the text by using NLTK, the means created in this study included a group of methods, such as common context words extraction, bigrams words extraction, probability statistics, and discourse analysis. Both of NLTK text content extraction framework suitable for Chinese texts and implementation method are obtained. In the results of empirical, it finds the content of the corpus which reflects the characteristics of the text, and gets the conclusion that a strong correlation between the results of extraction and text topic.
keywords: natural language processing Chinese texts NLTK
文章编号: 中图分类号: 文献标志码:
基金项目:
引用文本:
李晨,刘卫国.基于NLTK的中文文本内容抽取方法.计算机系统应用,2019,28(1):275-278
LI Chen,LIU Wei-Guo.Chinese Text Information Extraction Based on NLTK.COMPUTER SYSTEMS APPLICATIONS,2019,28(1):275-278
李晨,刘卫国.基于NLTK的中文文本内容抽取方法.计算机系统应用,2019,28(1):275-278
LI Chen,LIU Wei-Guo.Chinese Text Information Extraction Based on NLTK.COMPUTER SYSTEMS APPLICATIONS,2019,28(1):275-278