Abstract:Automatic image captioning is a hot topic which connects natural language processing and computer vision. It mainly completes the task of understanding image semantic information and expressing it in the form of human natural language. For the overall quality of Chinese image captioning is not very high, this study uses FastText to generate word vector, uses convolution neural network to extract the global features of the image, then encodes the pairs of sentences and images〈S, I〉, and finally merges them into a feature matrix containing both Chinese description and image information. Decoder uses LSTM model to decode the feature matrix, and obtains the decoding result by calculating cosine similarity. Through comparison, we find that the model proposed in this study is better than other models in BiLingual Evaluation Understudy (BLEU). The Chinese description generated by the model can accurately summarize the semantic information of the image.