Abstract:With the development of artificial intelligence, scene recognition has attracted more and more researchers' attention, which is one of the important directions of computer vision research. The traditional manual features cannot sufficiently describe the characteristics of the scene images, which leading to unsatisfied performance. On the contrary, the features extracted from Convolutional Neural Networks (CNN) contain rich semantics and structural information of the scene images. As one of the most common architectures, AlexNet network model is chosen in this study. By improving the following 4 aspects of the network:depth, width,multi-scale extraction, and multilayer fusion, the proposed approach achieves high accuracies of 92.0% and 94.5% on two publicly available datasets respectively, showing the superiority compared with other methods.