基于双路细化注意力机制的图像描述模型
作者:

Image Captioning Based on Dual Refined Attention
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [22]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    图像描述是连接计算机视觉与自然语言处理两大人工智能领域内的一项重要任务.近几年来,基于注意力机制的编码器-解码器架构在图像描述领域内取得了显著的进展.然而,许多基于注意力机制的图像描述模型仅使用了单一的注意力机制.本文提出了一种基于双路细化注意力机制的图像描述模型,该模型同时使用了空间注意力机制与通道注意力机制,并且使用了细化图像特征的模块,对图像特征进行进一步细化处理,过滤掉图像中的冗余与不相关的特征.我们在MS COCO数据集上进行实验来验证本文模型的有效性,实验结果表明本文的基于双路细化注意力机制的图像描述模型与传统方法相比有显著的优越性.

    Abstract:

    Image captioning is an important task, which connects computer vision and natural language processing, two major artificial intelligence fields. In recent years, encoder-decoder frameworks integrated with attention mechanism have made significant process in captioning. However, many attention-based methods only use spatial attention mechanism. In this study, we propose a novel dual refined attention model for image captioning. In the proposed model, we use not only spatial attention but also channel-wise attention and then use a refine module to refine the image features. By using the refine module, the proposed model can filter the redundant and irrelevant features in the attended image features. We validate the proposed model on MSCOCO dataset via various evaluation metrics, and the results show the effectiveness of the proposed model.

    参考文献
    [1] Ren SQ, He KM, Girshick R, et al. Faster R-CNN:Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015. 91-99.
    [2] Vinyals O, Toshev A, Bengio S, et al. Show and tell:A neural image caption generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3156-3164.
    [3] Ioffe S, Szegedy C. Batch normalization:Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    [4] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780.[doi:10.1162/neco.1997.9.8.1735
    [5] Wu Q, Shen CH, Liu LQ, et al. What value do explicit high level concepts have in vision to language problems? Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 203-212.
    [6] Yao T, Pan YW, Li YH, et al. Boosting image captioning with attributes. Proceedings of the 20017 IEEE International Conference on Computer Vision. Venice, Italy. 2017. 4894-4902.
    [7] Ranzato MA, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
    [8] Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 7008-7024.
    [9] Xu K, Ba J, Kiros R, et al. Show, attend and tell:Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015. 2048-2057.
    [10] You QZ, Jin HL, Wang ZW, et al. Image captioning with semantic attention. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 4651-4659.
    [11] Lu JS, Xiong CM, Parikh D, et al. Knowing when to look:Adaptive attention via a visual sentinel for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 375-383.
    [12] Chen L, Zhang HW, Xiao J, et al. Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 5659-5667.
    [13] Zhou BL, Bau D, Oliva A, et al. Interpreting deep visual representations via network dissection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(9):2131-2145.[doi:10.1109/TPAMI.2018.2858759
    [14] Lin TY, Maire M, Belongie S, et al. Microsoft coco:Common objects in context. Proceedings of European Conference on Computer Vision (ECCV 2014). Zurich, Switzerland. 2014. 740-755.
    [15] Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3128-3137.
    [16] Papineni K, Roukos S, Ward T, et al. BLEU:A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, PA, USA. 2002. 311-318.
    [17] Banerjee S, Lavie A. METEOR:An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, MI, USA. 2005. 65-72.
    [18] Lin CY. Rouge:A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out. Barcelona, Spain. 2004. 74-81.
    [19] Vedantam R, Zitnick CL, Parikh D. Cider:Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 4566-4575.
    [20] Chen XL, Fang H, Lin TY, et al. Microsoft coco captions:Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
    [21] Kingma DP, Ba J. Adam:A method for stochastic optimi-zation. arXiv preprint arXiv:1412.6980, 2015.
    [22] 周治平, 张威. 结合视觉属性注意力和残差连接的图像描述生成模型. 计算机辅助设计与图形学学报, 2018, 30(8):1536-1542, 1553
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

丛璐文.基于双路细化注意力机制的图像描述模型.计算机系统应用,2020,29(5):245-251

复制
分享
文章指标
  • 点击次数:1303
  • 下载次数: 1988
  • HTML阅读次数: 1947
  • 引用次数: 0
历史
  • 收稿日期:2019-10-07
  • 最后修改日期:2019-11-07
  • 在线发布日期: 2020-05-07
  • 出版日期: 2020-05-15
文章二维码
您是第11200515位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号