Abstract:Learning-based multi-view stereo matching algorithms have achieved remarkable results, but still have the problems of limited convolutional receptive field and ignoration of image frequency information, which lead to insufficient matching performance on low-texture, repetitive, and non-Lambertian surfaces. To address these problems, this study proposes CAF-MVSNet, a context-enhanced and image-frequency-guided multi-view stereo matching network. First, the context enhancement module is fused into the feature pyramid network in the feature extraction stage to effectively expand the receptive field of the network. Then the image-frequency-guided attention module is introduced to obtain the information of lines, shapes, textures, and colors of the images by encoding different frequencies of the images, which enhances the remote contextual connection of the images and further solves the problem of accurate matching of low-texture, repetitive, and non-Lambertian surfaces for reliable feature matching. Experimental results on the DTU dataset show that CAF-MVSNet has a 12.3% improvement in the combined error compared to the classical cascade model CasMVSNet, demonstrating excellent performance. In addition, good results are achieved on the Tanks and Temples dataset, reflecting the good generalization performance of CAF-MVSNet.