Abstract:In order to take advantage of the rich literature resources on the WEB, this paper designed a professional web literature collection system WLES. The WLES integrates Web crawling and Web cleaning technology. The machine learning method is introduced to the study of Web cleaning. Machine learning on the training data can get a clean model, and then use the model to implement web cleaning. Experiments show: WLES in web crawling and web page cleaning has an excellent performance, to meet the needs of the user's literature collection.