基于访问控制模块与原始信息注入的图像描述
作者:
基金项目:

山东省自然科学基金(ZR2020MF136); 中央高校自主创新科研计划(20CX05018A)


Image Captioning Based on Visiting Control Module and Original Information Injection
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [25]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    近年来在图像描述领域对于应用场景图生成描述的研究越来越广泛. 然而, 当前基于场景图的图像描述模型并未考虑到长短期记忆神经网络(LSTM)对于先前输入的细节信息的保留, 这可能会导致细节信息的丢失. 针对这个问题, 本文提出基于原始信息注入的图像描述网络, 该网络对基线模型中语言LSTM的输入变量做了改进, 目的是尽可能多地保留原始输入信息, 减少输入信息在计算过程中的损失. 另外, 本文还认为当前的场景图更新机制中存在结点更新程度过大的问题, 因此本文设计了一个访问控制模块更新已访问过的结点权重, 避免引起结点信息丢失的问题. 同时, 本文设计一个图更新系数(GUF)来指导图更新, 以确定更新程度的大小. 本文在官方数据集MSCOCO上进行了实验, 各种评估机制的实验结果表明, 基于访问控制模块与原始信息注入的图像描述模型与基线模型对比, 取得了更有竞争力的结果, 表现出明显的优越性.

    Abstract:

    In recent years, the application of scene graphs in image captioning has been increasingly researched. However, the current image captioning models based on scene graphs cannot take into account the previous input retained in long short-term memory (LSTM) networks, which may lead to missed information. In this study, we firstly propose the image captioning network based on original information injection, which keeps the original input information as much as possible and reduces the missed information. Secondly, we consider that the degree of the current graph updating mechanism is too large, which may lead to the missing of node information. Thus, we propose a visit control module to update the weights of visited nodes, avoiding such missing. Finally, we design a graph update factor (GUF) to determine the update level. We conduct experiments on the official dataset: MSCOCO. The mechanism evaluation shows that our model has achieved more competitive results compared with the baseline model.

    参考文献
    [1] Chen XL, Fang H, Lin TY, et al. Microsoft COCO captions: Data collection and evaluation server. arXiv: 1504.00325, 2015.
    [2] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780. [doi: 10.1162/neco.1997.9.8.1735
    [3] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 3156–3164.
    [4] Xu K, Ba JL, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille: JMLR.org, 2015. 2048–2057.
    [5] Lu JS, Xiong CM, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3242–3250.
    [6] Ranzato MA, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks. arXiv: 1511.06732, 2015.
    [7] Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1179–1195.
    [8] Ren SQ, He KM, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [doi: 10.1109/TPAMI.2016.2577031
    [9] Anderson P, He XD, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6077–6086.
    [10] Yang X, Tang KH, Zhang HW, et al. Auto-encoding scene graphs for image captioning. Proceedings of 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 10677–10686.
    [11] Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 3668–3678.
    [12] Li XY, Jiang SQ. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia, 2019, 21(8): 2117–2130. [doi: 10.1109/TMM.2019.2896516
    [13] Felzenszwalb PF, Girshick RB, Mcallester D, et al. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9): 1627–1645. [doi: 10.1109/TPAMI.2009.167
    [14] Li YK, Ouyang WL, Zhou BL, et al. Factorizable net: An efficient subgraph-based framework for scene graph generation. Proceedings of 15th European Conference on Computer Vision. Munich: Springer, 2018. 346–363.
    [15] Xu DF, Zhu YK, Choy C B, et al. Scene graph generation by iterative message passing. Proceedings of 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3097–3106.
    [16] Zellers R, Yatskar M, Thomson S, et al. Neural motifs: Scene graph parsing with global context. Proceedings of 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2017. 5831–5840.
    [17] Chen SZ, Jin Q, Wang P, et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9959–9968.
    [18] Lin TY, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context. Proceedings of 13th European Conference on Computer Vision. Zurich: IEEE, 2014. 740–755.
    [19] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.
    [20] Lin CY. Rouge: A package for automatic evaluation of summaries. Text summarization branches out. Barcelona: Association for Computational Linguistics, 2004. 74–81.
    [21] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: Association for Computational Linguistics, 2005. 65–72.
    [22] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.
    [23] Chen XL, Zitnick CL. Mind’s eye: A recurrent visual repre-sentation for image caption generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 2422–2431.
    [24] Donahue J, Hendricks LA, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 2625–2634.
    [25] Mao JH, Xu W, Yang Y, et al. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv: 1412.6632, 2014.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李阳,路静,郝宇钦,韦学艳,吴春雷.基于访问控制模块与原始信息注入的图像描述.计算机系统应用,2022,31(7):106-112

复制
分享
文章指标
  • 点击次数:905
  • 下载次数: 27123
  • HTML阅读次数: 1355
  • 引用次数: 0
历史
  • 收稿日期:2021-10-21
  • 最后修改日期:2021-11-18
  • 在线发布日期: 2022-03-09
文章二维码
您是第11246044位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号