Corpus Collection Based on Semantic Relevancy Focused Crawler

doi:10.15888/j.cnki.csa.006922

AIPUB归智期刊联盟

WeChat

Mobile website

2025-8-2- 23

Home > Archive>Volume 28, Issue 5, 2019 >190-195. DOI:10.15888/j.cnki.csa.006922

PDF HTML XML Export Cite reminder

Corpus Collection Based on Semantic Relevancy Focused Crawler
DOI:
                        10.15888/j.cnki.csa.006922
                    
CSTR:
                        [cstr]
                    
Author:
                        ZHOU KunZHOU Kun
University of Chinese Academy of Sciences, Beijing 100049, China;Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
WANG ZhaoWANG Zhao
Center for Information Technology, Shenyang State Tax Bureau, Shenyang 110013, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
YU Bi-HuiYU Bi-Hui
University of Chinese Academy of Sciences, Beijing 100049, China;Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

To address the corpus collection, the corpus collection system based on semantic relevancy focused crawler is implemented. Word vector trained by Wikipedia and HowNet are used for calculating page information semantic relevancy with descriptive information according to topical keywords, and the URL structural information is used for calculating the topical relevancy. Experimental results show that this system has better effect on party-construction corpus collection with high precision of average accurate rate 94.87%, while the average accurate rate for web pages is 64.20%.

Key words:corpus collection;semantic relevancy focused crawler;page information semantic relevancy;URL structural information

Get Citation

周昆,王钊,于碧辉.基于语义相关度主题爬虫的语料采集方法.计算机系统应用,2019,28(5):190-195

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:November 26,2018
Revised:December 18,2018
Adopted:
Online: May 05,2019
Published: May 15,2019

Article QR Code

You are the first1025880Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address：4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code：100190
Phone：010-62661041 Fax： Email：csa (a) iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063