Abstract:With the significant progress of satellite video imaging technology, object tracking in satellite videos has attracted more and more researchers’ attention. However, most of the previous research obtains spatial information through the global attention mechanism, which makes the model focus on the background part and thus ignore the object; moreover, only spatial information of the object in the video frames is utilized, resulting in inaccurate object localization. In this study, we improve the existing Siamese network object tracking model SiamCAR and a spatio-temporal Siamese network Siam-STM. Specifically, we proposes a spatial information perception module based on the attention mechanism, which aggregates the contextual information in the images and enhances the discriminative capability of small object features in the satellite videos; to utilize the temporal information across video frames, a temporal information perception module is proposed to fuse the current frame with the historical frames, enabling the position information of the object across time to be learned, the object’s trajectory to be better tracked, and the interference from similar objects to be mitigated. In addition, to mitigate the effects of occlusion in satellite videos, this study introduces a linear fitting method based on the Kalman filter and then proposes a motion estimation mechanism. This mechanism can effectively model the motion characteristics of the object, allowing accurate localization even during occlusions. The effectiveness of Siam-STM is verified by comparing it with state-of-the-art models on the SatSOT dataset.