###
DOI:
计算机系统应用英文版:2010,19(2):106-109
本文二维码信息
码上扫一扫!
基于优势率的改进二元特征提取方法
(中国科学技术大学 计算机科学与技术学院 安徽 合肥 230027)
An Enhanced Odds Ratio Dualistic Feature Extraction Method
摘要
图/表
参考文献
相似文献
本文已被:浏览 1875次   下载 3734
Received:May 18, 2009    
中文摘要: 主题网络爬虫研究中一个重要问题是文本特征的提取,其好坏会直接影响主题特征的提取及网页的相关性计算。在研究了文本分类特征提取方法的基础上,分析优势率特征提取方法的优缺点,把频度、分散度作为判断要素加以考虑,提出一种改进的二元分类特征选择方法EOR,并使用得到的EOR值结合词频TF即TF-EOR来计算文档特征词的权重,应用于主题网络爬虫。仿真实验证明,EOR在中低维数下能提升文档分类准确率达5%,而TF-EOR权重计算方法好于TF-IDF方法,实验中提高了网络爬虫的抓取准确率和查全率达4%。
Abstract:An important issue in topical crawler research is feature extraction, which makes great impact on topic description and page relevance scoring. The existing Odds Ratio method shows high performance on high dimension vectors, whereas it does not work well on low dimension condition. An enhanced method EOR based on Odds Ratio method, with word frequency and distribution rate taken into account, is proposed. The simulation shows a 5% increase on text categorization precision on low and middle feature dimension. Furthermore, by combining EOR score and TF value, namely, TF-EOR to calculate word weight and applying it to topical crawler, 4% increases on both precision and recall are obtained.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
杜一平,刘燕君.基于优势率的改进二元特征提取方法.计算机系统应用,2010,19(2):106-109
DU Yi-Ping,LIU Yan-Jun.An Enhanced Odds Ratio Dualistic Feature Extraction Method.COMPUTER SYSTEMS APPLICATIONS,2010,19(2):106-109