New Word Detection Algorithm Combining Correlation Confidence and Jieba Word Segmentation
CSTR:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Word segmentation is one of the most important steps in Chinese natural language processing, it is the basis for keyword extraction, automatic text summarization, and text clustering, the quality of the word segmentation directly affects the accuracy of further text processing. In recent years, with the rise of free public opinion platforms such as Microblog, live broadcast platform, and WeChat Moments, a large number of new words have brought great challenges to word segmentation methods. To solve the problem such as the small overall amount of new words, the flexible usage of new words, and excessive merging of words leads to the formation of phrase blocks in the process of new words discovering. This study proposed a new word detection algorithm combining correlation confidence and Jieba word segmentation. The algorithm is based on the preliminary word segmentation results by Jiaba library in Python, then calculates the correlation confidence between adjacent words to merge incorrectly split words into candidate new words, and by splitting the conjunctions to prevent multiple words from being connected into phrases. Compared with other confidence-based word segmentation methods, the proposed algorithm can greatly improve the accuracy of discovering new words, especially named entities and network terms, and reduce the length of new words while ensuring the integrity of new words. In the context of a small amount of test corpus, the proposed algorithm still has the ability to recognize low frequency new words.

    Reference
    Related
    Cited by
Get Citation

曹帅.结合关联置信度与结巴分词的新词发现算法.计算机系统应用,2020,29(5):144-151

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:October 09,2019
  • Revised:November 04,2019
  • Adopted:
  • Online: May 07,2020
  • Published: May 15,2020
Article QR Code
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063