Abstract:Image caption represents a research hotspot in the field of image understanding. In view of the poor quality of sentences, we propose Chinese image caption combining dual attention and multi-label images. We extract visual features and multi-label text firstly, and then use multi-label text to enhance the correlation between the hidden state of the decoder and visual features. Next, we redistribute attention weights to the visual features according to the hidden state of the decoder and decode the weighted features into words. Finally, the words are output in a time sequence to obtain Chinese sentences. Experiments on Chinese image caption datasets, Flickr8k-CN and COCO-CN, reveal that the proposed method substantially improves the quality of sentences.