Corpus Collection Based on Semantic Relevancy Focused Crawler
CSTR:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    To address the corpus collection, the corpus collection system based on semantic relevancy focused crawler is implemented. Word vector trained by Wikipedia and HowNet are used for calculating page information semantic relevancy with descriptive information according to topical keywords, and the URL structural information is used for calculating the topical relevancy. Experimental results show that this system has better effect on party-construction corpus collection with high precision of average accurate rate 94.87%, while the average accurate rate for web pages is 64.20%.

    Reference
    Related
    Cited by
Get Citation

周昆,王钊,于碧辉.基于语义相关度主题爬虫的语料采集方法.计算机系统应用,2019,28(5):190-195

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:November 26,2018
  • Revised:December 18,2018
  • Adopted:
  • Online: May 05,2019
  • Published: May 15,2019
Article QR Code
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063