###
计算机系统应用英文版:2018,27(12):216-221
←前一篇   |   后一篇→
本文二维码信息
码上扫一扫!
多因素影响特征选择的短文本分类方法
(1.太原科技大学 计算机科学与技术学院, 太原 030024;2.中国科学院 地理科学与资源研究所, 北京100101)
Short Text Classification Based on Multi-Factors Affecting Features Selection
(1.School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China;2.Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1167次   下载 1636
Received:May 04, 2018    Revised:May 24, 2018
中文摘要: 特征选择即是降维去噪的过程,一个词汇是否具有强的类别区分能力通过特征选择评价函数的权值大小来衡量,然而影响特征选择的因素有很多,主要包括特征的维度、重要性和语义;针对短文本信息量少导致特征表示高维稀疏和传统特征提取方法缺乏语义的问题,构建多因素融合的特征选择函数FS,和传统的特征选择函数TF-IDF对比,FS不仅融入了特征的语义性,而且能够去除大量冗余特征,提高具有类别区分能力特征的权重;把FS作为新的特征选择函数,使用搜狗实验室的中文语料库进行短文本分类实验,验证了方法有效性.
Abstract:Feature Selection (FS) is reducing dimensions and denoising. However, there are many factors that affect the features selection, mainly including the dimensions, importance, and semantic of terms. For feature representing high-dimensional but sparse of short text and traditional features extraction lack semantic, a feature selection function FS fusing multi-factors is constructed. It is verified that FS not only can integrate the semantics of the feature, but also can remove a large number of redundant features, thus improve the weight of the features with class distinction capabilities, comparing with the traditional feature selection function TF-IDF. FS as a new function, using the Chinese corpus of Sogou Lab for short text classification, verifys the effectiveness of the method.
文章编号:     中图分类号:    文献标志码:
基金项目:山西省中科院科技合作项目(20141101001);“十二五”山西省科技重大专项项目(20121101001);山西省社会发展科技攻关项目(20140313020-1)
引用文本:
李文慧,张英俊,潘理虎.多因素影响特征选择的短文本分类方法.计算机系统应用,2018,27(12):216-221
LI Wen-Hui,ZHANG Ying-Jun,PAN Li-Hu.Short Text Classification Based on Multi-Factors Affecting Features Selection.COMPUTER SYSTEMS APPLICATIONS,2018,27(12):216-221