本文已被:浏览 1367次 下载 3361次
Received:February 23, 2017 Revised:March 09, 2017
Received:February 23, 2017 Revised:March 09, 2017
中文摘要: 文档的特征提取和文档的向量表示是文档分类中的关键,本文针对这两个关键点提出一种基于word2vec的文档分类方法.该方法根据DF采集特征词袋,以尽可能的保留文档集中的重要特征词,并且利用word2vec的潜在语义分析特性,将语义相关的特征词用一个主题词乘以合适的系数来代替,有效地浓缩了特征词袋,降低了文档向量的维度;该方法还结合了TF-IDF算法,对特征词进行加权,给每个特征词赋予更合适的权重.本文与另外两种文档分类方法进行了对比实验,实验结果表明,本文提出的基于word2vec的文档分类方法在分类效果上较其他两种方法均有所提高.
Abstract:The feature extraction and the vector representation are the key points in document classification. In this paper, we propose a classification method based on word2vec for the two key points. This method builds the bag of feature words by Document Frequency (DF) to retain the important feature of the document as much as possible. It takes advantage of the Latent Semantic Analysis of word2vec thus to reduce the size of bag of feature words and the dimension of document vector effectively, which replaces the semantically relevant words with the product of a topic word and proper parameters. Besides, it also gives each feature word the optimal weight by combining with the TF-IDF algorithm. Finally, compared with two other document classification methods, the method presented in this paper has made some significant progress, and the experimental result has proved its effectiveness.
文章编号: 中图分类号: 文献标志码:
基金项目:
引用文本:
陈杰,陈彩,梁毅.基于Word2vec的文档分类方法.计算机系统应用,2017,26(11):159-164
CHEN Jie,CHEN Cai,LIANG Yi.Document Classification Method Based on Word2vec.COMPUTER SYSTEMS APPLICATIONS,2017,26(11):159-164
陈杰,陈彩,梁毅.基于Word2vec的文档分类方法.计算机系统应用,2017,26(11):159-164
CHEN Jie,CHEN Cai,LIANG Yi.Document Classification Method Based on Word2vec.COMPUTER SYSTEMS APPLICATIONS,2017,26(11):159-164