Abstract:Based on the traditional search model, combining the concept of ontology, this paper proposes a thematic network crawling model based on ontology semantic tree. Unlike the traditional keyword-based subject description methods, the model can describe a subject with semantic concept tree with which it is simple to describe the semantic relationships between concepts. On this basis, the paper presents a method to calculate the relevance of HTML pages and the topic. When analyzing the relevance of URL, it does not only analyze the relevance of link anchor text and the topic, but also analyzes the relevance of the link with an improved PageRank algorithm. Only when the relevance does not reach a given threshold will it download the page corresponding to the URL. This calculation method can greatly reduce unnecessary computational overhead, and make fully use of anchor text and link importance of information. Finally, it calculates the relevance of a web page which is not sure whether it is related to the topic, and ultimately determines whether this page should be collected or not.