Abstract:In recent years, the web content detection mainly focuses on how to extract features from HTML document through semantic analysis or emulation execution, while it is undesirable, because it significantly complicates implementation which requires high computational overhead, and opens up an attack surface within the detector. A deep learning approach to detect malicious web pages is proposed. Firstly, we take advantage of the non-complex regular expression to extract tokens from static HTML document, then capture locality representation at multiple hierarchical spatial scales over the document with neural network model, by which the mode can quickly find tiny fragments of malicious code in any length of web pages. The experimental results show that this approach achieves a detection rate of 96.4% at a false positive rate of 0.1%, much better than the baseline and simplified model at the classification accuracy. The speed and accuracy of proposed approach makes it appropriate for deployment to endpoints, firewalls and web proxies.