Abstract:Recently, attention mechanisms have been widely used in computer vision in such aspects as the common encoder/decoder framework for image captioning. However, the current decoding framework does not clearly analyze the correlation between image features and the hidden states of the Long Short-Term Memory (LSTM) network, leading to cumulative errors. In this study, we propose a Similar Temporal Attention Network (STAN) that extends conventional attention mechanisms to strengthen the correlation between attention results and hidden states at different moments. STAN first applies attention to the hidden state and feature vector at the current moment, and then introduces the attention result of two adjacent LSTM segments into the recurrent LSTM network at the next moment through an Attention Fusion Slot (AFS) to enhance the correlation between attention results and hidden states. Also, we design a Hidden State Switch (HSS) to guide the generation of words, which is combined with the AFS to reduce cumulative errors. According to the extensive experiments on the public benchmark dataset Microsoft COCO and various evaluation mechanisms, our algorithm is superior to the baseline model and can get more competitive attention results.