Abstract:The image caption generation model uses natural language to describe the content of images and the relationship between attributes. In the existing models, there are problems of low description quality, insufficient feature extraction of important parts of images, and high complexity. Therefore, this study proposes an image caption generation model based on a Convolutional Block Attention Module (CBAM), which has an encoder-decoder structure. CBAM is added into the feature extraction network Inception-v4 and as an encoder, extracts the important feature information of the images. The information is then sent into the Long Short-Term Memory (LSTM) of the decoder to generate the caption of the corresponding pictures. The MSCOCO2014 data set is applied to training and testing, and multiple evaluation criteria are used to evaluate the accuracy of the model. The experimental results show that the improved model has a higher evaluation criterion score than other models, and Model2 can better extract image features and generate a more accurate description.