结合关联置信度与结巴分词的新词发现算法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


New Word Detection Algorithm Combining Correlation Confidence and Jieba Word Segmentation
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在中文自然语言处理领域中,分词是非常重要的步骤之一,它是关键词抽取、文本自动摘要、文本聚类的基础,分词结果的好坏直接影响进一步文本处理的准确性.近年来随着微博平台、直播平台、朋友圈等自由舆情平台的兴起,大量不规范使用的舆情文本尤其是不断出现的新词给分词结果的准确性带来了巨大的挑战,新词发现成为分词算法必须解决的问题.为解决在新词发现过程中,新词整体数据体量小、新词用法灵活以及过度合并词语易形成短语块等问题,本文提出了结合关联置信度与结巴分词的新词发现算法,该算法以结巴分词的初步分词结果为基础,通过计算词语与其左右邻接词集中各个词语之间的关联置信度,将被错误拆分的词语合并成候选新词,并通过切分连接词以防止多个词语被连接成短语的情况出现.以微博言论数据进行测试的实验表明,相比于其它基于置信度的分词方法结果,本文提出的算法可以大幅度提升发现新词尤其是命名实体、网络用语的准确率,在确保新词语义完整的前提下降低新词长度,并且在少量测试语料的情境下,本文提出的算法对低频新词依然具有识别能力.

    Abstract:

    Word segmentation is one of the most important steps in Chinese natural language processing, it is the basis for keyword extraction, automatic text summarization, and text clustering, the quality of the word segmentation directly affects the accuracy of further text processing. In recent years, with the rise of free public opinion platforms such as Microblog, live broadcast platform, and WeChat Moments, a large number of new words have brought great challenges to word segmentation methods. To solve the problem such as the small overall amount of new words, the flexible usage of new words, and excessive merging of words leads to the formation of phrase blocks in the process of new words discovering. This study proposed a new word detection algorithm combining correlation confidence and Jieba word segmentation. The algorithm is based on the preliminary word segmentation results by Jiaba library in Python, then calculates the correlation confidence between adjacent words to merge incorrectly split words into candidate new words, and by splitting the conjunctions to prevent multiple words from being connected into phrases. Compared with other confidence-based word segmentation methods, the proposed algorithm can greatly improve the accuracy of discovering new words, especially named entities and network terms, and reduce the length of new words while ensuring the integrity of new words. In the context of a small amount of test corpus, the proposed algorithm still has the ability to recognize low frequency new words.

    参考文献
    相似文献
    引证文献
引用本文

曹帅.结合关联置信度与结巴分词的新词发现算法.计算机系统应用,2020,29(5):144-151

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-10-09
  • 最后修改日期:2019-11-04
  • 录用日期:
  • 在线发布日期: 2020-05-07
  • 出版日期: 2020-05-15
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号