Microblog New Word Recognition Combining Skip-Gram Model and Word Vector Projection
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [20]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    With the popularity of microblog and other social networks, a steady stream of new words emerge, Chinese word segmentation systems often cut the new words into Chinese characters. The new word discovery has become a hot topic in the field of Chinese natural language processing. Existing new word recognition methods rely on the statistical data of large-scale corpus, the ability of new low-frequency word recognition is poor. This paper presents an extension of skip-gram model and word vector projection method, after the combination of the this two methods can ease the data sparseness problem effectively in natural language processing, to identify new low-frequency words, and to improve the precision and recall rate of Chinese word segmentation system.

    Reference
    1 黄昌宁,赵海.中文分词十年回顾.中文信息学报,2007,21(3): 8-19.
    2 Li HQ, Huang CN, Gao JF, Fan XZ. The use of SVM for Chinese new word identification. Natural Language Processing- IJCNLP 2004. Springer Berlin Heidelberg. 2005. 723-732.
    3 郑家恒,李文花.基于构词法的网络新词自动识别初探.山西大学学报(自然科学版),2002,25(2):115-119.
    4 崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测.计算机研究与发展,2006,43(5):927-932.
    5 罗盛芬,孙茂松.基于字串内部结合紧密度的汉语自动抽词实验研究.中文信息学报,2003,17(3):9-14.
    6 Feng HD, Chen K, Deng XT, Zheng WM. Accessor variety criteria for Chinese word extraction. Computational Linguistics, 2004, 30(1): 75-93.
    7 Li HQ, Huang CN, Gao JF, Fan XZ. The use of SVM for chinese new word identification. Natural Language Processing-IJCNLP 2004. Berlin Heidelberg: Springer-Verlag, 2004: 723-732.
    8 Chooi-ling G, Masayuki A, Yuji M. Training multi-classifiers for Chinese unknown word detection. Journal of Chinese Language and Computing, 2005, 15(1): 1-12.
    9 Ye YM, Wu QY, Li Y, Chow KP, Hui LCK, Yiu SM. Unknown Chinese word extraction based on variety of overlapping strings. Information Processing & Management, 2013, 49(2): 497-512.
    10 霍帅,张敏,刘奕群,马少平.基于微博内容的新词发现方法. 模式识别与人工智能,2014,27(2):141-145.
    11 廖健,王素格,李德玉,陈鑫.基于构词规则与互信息的微博情感新词发现与判定.第二十届全国信息检索学术会议(CCIR2014).第六届中文倾向性分析评测委员会.昆明.2014.90-96.
    12 邱云飞,刘世兴,魏海超,邵良杉.W-POS 语言模型及其选择与匹配算法.计算机应用,2015,35(8):2210-2214.
    13 Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y. A closer look at skip-gram modelling. Proc. of the Fifth International Conference on Language Resources and Evaluation. [s.l.]: Conference Publications. 2006. 1222-1225.
    14 Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Proc. of NIPS. [s.l.]: Conference Publications. 2013. 1-9.
    15 Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3: 1137-1155.
    16 Kohonen T. Self-organizing formation of topologically correct feature maps. Biol Cyber, 1982, 43: 59-69.
    17 Ron W, Lutgarde B. Self- and super-organising maps in r: the kohonen. Journal of Statistical Software, 2007, 21(5): 1-19.
    18 周超,严馨,余正涛,洪旭东,线岩团.融合词频特性及邻接变化数的微博新词识别.山东大学学报(理学版),2015,50(3): 6-10.
    19 Qiu XP, Qian P, Yin LS, Wu SY, Huang XJ. Overview of the NLPCC 2015 shared task: Chinese word segmentation and POS tagging for micro-blog texts. Springer International Publishing, 2015, 9362: 541-549.
    20 Chen XC, Qiu XP, Zhu CX, Huang XJ. Gated recursive neural network for Chinese word segmentation. Proc. of Annual Meeting of the Association for Computational Linguistics(ACL 2015). The Association for Computational Linguistics. Beijing. 2015.
    Cited by
Get Citation

于洁. Skip-Gram模型融合词向量投影的微博新词发现.计算机系统应用,2016,25(7):130-136

Copy
Related Videos

Share
Article Metrics
  • Abstract:1514
  • PDF: 3316
  • HTML: 0
  • Cited by: 0
History
  • Received:November 17,2015
  • Revised:December 21,2015
  • Online: July 21,2016
Article QR Code
You are the first1094993Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063