Abstract:The ground images obtained by the unmanned aerial vehicle (UAV) platform have a high spatial resolution, but they also bring a lot of “interference” to crop classification while providing rich details. In particular, when depth models are used for crop recognition, there are problems such as insufficient edge information extraction and misclassification of similarly textured crops, which results in a poor classification effect. Therefore, a model is constructed by the idea of multi-scale attention feature extraction to effectively extract edge information and improve the accuracy of crop classification. The proposed multi-scale attention network (MSAT) obtains crop information on different scales at the same level through multi-scale block embedding. The multi-scale feature map is mapped into multiple sequences that are fed into the factor attention module independently, which enhances the attention to crop contexts and improves the model’s extraction ability of plot edge information. Moreover, the built-in convolutional relative position encoding of the factor attention module enhances the modeling of local information inside the module and the ability to distinguish similarly textured crops. Finally, the thickness information is extracted upon the fusion of local features and global features. The classification results of rice, sugarcane, corn, bananas, and oranges show that the mean intersection over union (MIoU) and overall accuracy (OA) of the MSAT model reach 0.816 and 98.10%, respectively, which verifies that the fine crop classification method based on high-resolution images is feasible, and the equipment cost is low.