本文已被:浏览 1589次 下载 2435次
Received:October 09, 2019 Revised:November 04, 2019
Received:October 09, 2019 Revised:November 04, 2019
中文摘要: 在中文自然语言处理领域中,分词是非常重要的步骤之一,它是关键词抽取、文本自动摘要、文本聚类的基础,分词结果的好坏直接影响进一步文本处理的准确性.近年来随着微博平台、直播平台、朋友圈等自由舆情平台的兴起,大量不规范使用的舆情文本尤其是不断出现的新词给分词结果的准确性带来了巨大的挑战,新词发现成为分词算法必须解决的问题.为解决在新词发现过程中,新词整体数据体量小、新词用法灵活以及过度合并词语易形成短语块等问题,本文提出了结合关联置信度与结巴分词的新词发现算法,该算法以结巴分词的初步分词结果为基础,通过计算词语与其左右邻接词集中各个词语之间的关联置信度,将被错误拆分的词语合并成候选新词,并通过切分连接词以防止多个词语被连接成短语的情况出现.以微博言论数据进行测试的实验表明,相比于其它基于置信度的分词方法结果,本文提出的算法可以大幅度提升发现新词尤其是命名实体、网络用语的准确率,在确保新词语义完整的前提下降低新词长度,并且在少量测试语料的情境下,本文提出的算法对低频新词依然具有识别能力.
Abstract:Word segmentation is one of the most important steps in Chinese natural language processing, it is the basis for keyword extraction, automatic text summarization, and text clustering, the quality of the word segmentation directly affects the accuracy of further text processing. In recent years, with the rise of free public opinion platforms such as Microblog, live broadcast platform, and WeChat Moments, a large number of new words have brought great challenges to word segmentation methods. To solve the problem such as the small overall amount of new words, the flexible usage of new words, and excessive merging of words leads to the formation of phrase blocks in the process of new words discovering. This study proposed a new word detection algorithm combining correlation confidence and Jieba word segmentation. The algorithm is based on the preliminary word segmentation results by Jiaba library in Python, then calculates the correlation confidence between adjacent words to merge incorrectly split words into candidate new words, and by splitting the conjunctions to prevent multiple words from being connected into phrases. Compared with other confidence-based word segmentation methods, the proposed algorithm can greatly improve the accuracy of discovering new words, especially named entities and network terms, and reduce the length of new words while ensuring the integrity of new words. In the context of a small amount of test corpus, the proposed algorithm still has the ability to recognize low frequency new words.
keywords: natural language processing word segmentation confidence new word detection named entities
文章编号: 中图分类号: 文献标志码:
基金项目:
Author Name | Affiliation | |
CAO Shuai | College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China | s17070785@s.upc.edu.cn |
Author Name | Affiliation | |
CAO Shuai | College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China | s17070785@s.upc.edu.cn |
引用文本:
曹帅.结合关联置信度与结巴分词的新词发现算法.计算机系统应用,2020,29(5):144-151
CAO Shuai.New Word Detection Algorithm Combining Correlation Confidence and Jieba Word Segmentation.COMPUTER SYSTEMS APPLICATIONS,2020,29(5):144-151
曹帅.结合关联置信度与结巴分词的新词发现算法.计算机系统应用,2020,29(5):144-151
CAO Shuai.New Word Detection Algorithm Combining Correlation Confidence and Jieba Word Segmentation.COMPUTER SYSTEMS APPLICATIONS,2020,29(5):144-151