Abstract:In the monocular image depth estimation method based on deep learning, the depth information of the image is lost during the subsampling process of the convolutional neural networks, which leads to poor depth estimation of object edges. To solve this problem, this study presents a multi-scale feature fusion method, and an adaptive fusion strategy is adopted to dynamically adjust the fusion ratio of feature maps of different scales according to feature data to make full use of multi-scale feature information. In the monocular depth estimation task using atrous spatial pyramid pooling (ASPP), the pixel information loss affects the prediction results of small objects. When using ASPP on deep feature maps, the depth estimation result is improved by fusing rich feature information of shallow feature maps. The experimental results on the NYU-DepthV2 indoor dataset show that the method proposed in this study has a more accurate prediction of object edges and significantly improves the prediction of small objects. The root mean square error (RMSE) reaches 0.389 and the accuracy (δ<1.25) reaches 0.897, which verifies the effectiveness of the method.