结合关联置信度与结巴分词的新词发现算法

doi:10.15888/j.cnki.csa.007418

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年7月25日 11:23 星期五

首页 > 过刊浏览>2020年第29卷第5期 >144-151. DOI:10.15888/j.cnki.csa.007418

PDF HTML阅读 XML下载导出引用引用提醒

结合关联置信度与结巴分词的新词发现算法
DOI:
                        10.15888/j.cnki.csa.007418
                    
CSTR:
                        
                    
作者:
                        曹帅曹帅
中国石油大学(华东) 计算机科学与技术学院, 青岛 266580
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

New Word Detection Algorithm Combining Correlation Confidence and Jieba Word Segmentation

Author:

CAO Shuai
CAO Shuai
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

在中文自然语言处理领域中，分词是非常重要的步骤之一，它是关键词抽取、文本自动摘要、文本聚类的基础，分词结果的好坏直接影响进一步文本处理的准确性.近年来随着微博平台、直播平台、朋友圈等自由舆情平台的兴起，大量不规范使用的舆情文本尤其是不断出现的新词给分词结果的准确性带来了巨大的挑战，新词发现成为分词算法必须解决的问题.为解决在新词发现过程中，新词整体数据体量小、新词用法灵活以及过度合并词语易形成短语块等问题，本文提出了结合关联置信度与结巴分词的新词发现算法，该算法以结巴分词的初步分词结果为基础，通过计算词语与其左右邻接词集中各个词语之间的关联置信度，将被错误拆分的词语合并成候选新词，并通过切分连接词以防止多个词语被连接成短语的情况出现.以微博言论数据进行测试的实验表明，相比于其它基于置信度的分词方法结果，本文提出的算法可以大幅度提升发现新词尤其是命名实体、网络用语的准确率，在确保新词语义完整的前提下降低新词长度，并且在少量测试语料的情境下，本文提出的算法对低频新词依然具有识别能力.

关键词:自然语言处理;分词;置信度;新词发现;命名实体

Abstract:

Word segmentation is one of the most important steps in Chinese natural language processing, it is the basis for keyword extraction, automatic text summarization, and text clustering, the quality of the word segmentation directly affects the accuracy of further text processing. In recent years, with the rise of free public opinion platforms such as Microblog, live broadcast platform, and WeChat Moments, a large number of new words have brought great challenges to word segmentation methods. To solve the problem such as the small overall amount of new words, the flexible usage of new words, and excessive merging of words leads to the formation of phrase blocks in the process of new words discovering. This study proposed a new word detection algorithm combining correlation confidence and Jieba word segmentation. The algorithm is based on the preliminary word segmentation results by Jiaba library in Python, then calculates the correlation confidence between adjacent words to merge incorrectly split words into candidate new words, and by splitting the conjunctions to prevent multiple words from being connected into phrases. Compared with other confidence-based word segmentation methods, the proposed algorithm can greatly improve the accuracy of discovering new words, especially named entities and network terms, and reduce the length of new words while ensuring the integrity of new words. In the context of a small amount of test corpus, the proposed algorithm still has the ability to recognize low frequency new words.

Key words:natural language processing;word segmentation;confidence;new word detection;named entities

引用本文

曹帅.结合关联置信度与结巴分词的新词发现算法.计算机系统应用,2020,29(5):144-151

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2019-10-09
最后修改日期:2019-11-04
录用日期:
在线发布日期: 2020-05-07
出版日期: 2020-05-15

微信公众号

网站二维码

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码