Abstract:After crawling to obtain a data set of large web pages on a large commodity site, the data set is screened to further get the target data set. Before screening, preparation must to be done is to delete the redundant tags in the web pages. Therefore, the algorithm of deletion tag is given with the idea of a recursive algorithm. The design idea of tag deletion function is put forward. 2 time design improvements are carried out to optimize the performance. Finally, the design idea of dual thread is adopted. The dual threads are 1 maintain buffer thread and 1 tag deletion thread. In single computer environment, experiments show that the optimized tag deletion function only takes 19.7 seconds for each 1000 pages, and only 1.1 hours for 200 000 web pages.