Abstract:Visual navigation uses the visual information in the environment as the navigation basis, and one of the key tasks of visual navigation is object detection. Traditional object detection methods require a large number of annotations and only focus on the image itself, failing to fully utilize the data similarity in visual navigation tasks. To solve the above problem, this paper proposes a self-supervised training task based on historical image information. In this method, multi-moment images at the same location are aggregated. Furthermore, the foreground and the background are distinguished by information entropy, and the images are enhanced and then sent into the simple siamese (SimSiam) self-supervised paradigm for training. In addition, the multi-layer perception (MLP) networks in the projection and prediction layers of the SimSiam paradigm are upgraded into a convolutional attention module and a convolution module, and the loss function is improved into one of the losses among multi-dimensional vectors, thereby extracting multi-dimensional features from the images. Finally, the model pre-trained by the self-supervised paradigm is used to train the model for downstream tasks. Experiments reveal that the proposed method effectively improves the precision of downstream classification and detection tasks on the processed nuScenes dataset. Its Top5 precision on downstream classification tasks reaches 66.95%, and its mean average precision (mAP) on downstream detection tasks reaches 40.02%.