A Description Method of Text Feature Based on Word Clustering
CSTR:
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [10]
  • |
  • Related
  • |
  • Cited by
  • | |
  • Comments
    Abstract:

    Feature space has the high-dimensional problem in text mining. This paper presented a new description method of text feature based on word clustering. The purpose is to mine semantic association between words using machine learning, then to construct the concept dictionary in specific areas dynamically, finally to describe the text feature with the concept constructed. This method analyzes the co-occurrence of words in training corpus firstly, without using theme dictionary, then generates word cluster expressed in seed words which represents a concept of theme by word clustering, finally takes the seed words as text features. The experimental results indicate that this method not only reduces dimensionality of feature space but also overcomes the limitations of the concept in HowNet, and improve the performance of text categorization.

    Reference
    1 史忠植.知识发现.北京:清华大学出版社, 2002.
    2 周茜,赵明生,扈雯.中文文本分类中的特征选择研究.中息学报,2004,18(3):17-23.
    3 李莼,罗振声,厉宇航.基于语义相关和概念相关的自动分法研究.计算机工程与应用,2003,39(12):106-109.
    4 廖莎莎,江铭虎.中文文本分类中基于概念屏蔽层的特征方法.中文信息学报,2006,20(3):22-28.
    5 韩客松,王永成,沈洲,吴芳芳.三个层面的中文文本主题提取研究.中文信息学报,2001,15(4):20-27.
    6 Dhillon IS, Mallela S, Kumar R. Enhanced word clustering for hierarchical text classification. Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002. 191-200.
    7 Li H, Yamanishi K. Document classification using a finite mixture model. Proc. of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. 1997. 39-47.
    8 代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的研究.中文信息学报,2004,18(1):26-32.
    9 Li H, Yamanishi K. Topic analysis using a finite mixture model. Information processing and management, 2003,39(3): 521-541.
    10 Yang Y, Pedersen J. A comparative study on feature selection in text categorization. Proc. of the Fourteenth International Conference on Machine Learning. l997. 412-420.
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

陈炯,张永奎.一种基于词聚类的文本特征描述方法.计算机系统应用,2011,20(2):211-215

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 18,2010
  • Revised:August 03,2010
Article QR Code
You are the first991243Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063