Abstract:The main challenge of fine-grained image classification lies in the high similarity between classes and differences within classes. Most of the existing research is based on deep features and ignores shallow details. However, deep semantic features often lose a lot of details due to multiple convolution and pooling operations. To better integrate shallow and deep information, this study proposes a fine-grained image classification method based on cross-layer collaborative attention and channel grouping attention. First, the pre-trained model loaded by ResNet50 is taken as the backbone network to extract features, and the features extracted by the last three stages are output in the form of three branches. The features of each branch are calculated and coordinated with the features of the other two branches in a cross-layer manner and interactive fusion. Specifically, the features of the last stage pass through the channel grouping attention module to enhance the learning ability of semantic features. Model training can be efficiently trained in an end-to-end manner without bounding boxes and annotations. Experimental results show that the algorithm performs well on three common fine-grained image datasets CUB-200-2011, Stanford Cars, and FGVC-Aircraft. The accuracy rates reach 89.5%, 94.8%, and 94.7%, respectively.