基于访问控制模块与原始信息注入的图像描述

doi:10.15888/j.cnki.csa.008593

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月4日 1:57 星期五

首页 > 过刊浏览>2022年第31卷第7期 >106-112. DOI:10.15888/j.cnki.csa.008593

PDF HTML阅读 XML下载导出引用引用提醒

基于访问控制模块与原始信息注入的图像描述
DOI:
                        10.15888/j.cnki.csa.008593
                    
CSTR:
                        
                    
作者:
                        李阳李阳
中国石油大学(华东) 计算机科学与技术学院, 青岛 266580
在期刊界中查找
在百度中查找
在本站中查找
路静路静
中国石油大学(华东) 计算机科学与技术学院, 青岛 266580
在期刊界中查找
在百度中查找
在本站中查找
郝宇钦郝宇钦
中国石油大学(华东) 计算机科学与技术学院, 青岛 266580
在期刊界中查找
在百度中查找
在本站中查找
韦学艳韦学艳
中国石油大学(华东) 计算机科学与技术学院, 青岛 266580
在期刊界中查找
在百度中查找
在本站中查找
吴春雷吴春雷
中国石油大学(华东) 计算机科学与技术学院, 青岛 266580
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:山东省自然科学基金(ZR2020MF136); 中央高校自主创新科研计划(20CX05018A)

Image Captioning Based on Visiting Control Module and Original Information Injection

Author:

LI Yang
LI Yang
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
在期刊界中查找
在百度中查找
在本站中查找
LU Jing
LU Jing
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
在期刊界中查找
在百度中查找
在本站中查找
HAO Yu-Qin
HAO Yu-Qin
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
在期刊界中查找
在百度中查找
在本站中查找
WEI Xue-Yan
WEI Xue-Yan
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
在期刊界中查找
在百度中查找
在本站中查找
WU Chun-Lei
WU Chun-Lei
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [25]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

近年来在图像描述领域对于应用场景图生成描述的研究越来越广泛. 然而, 当前基于场景图的图像描述模型并未考虑到长短期记忆神经网络(LSTM)对于先前输入的细节信息的保留, 这可能会导致细节信息的丢失. 针对这个问题, 本文提出基于原始信息注入的图像描述网络, 该网络对基线模型中语言LSTM的输入变量做了改进, 目的是尽可能多地保留原始输入信息, 减少输入信息在计算过程中的损失. 另外, 本文还认为当前的场景图更新机制中存在结点更新程度过大的问题, 因此本文设计了一个访问控制模块更新已访问过的结点权重, 避免引起结点信息丢失的问题. 同时, 本文设计一个图更新系数(GUF)来指导图更新, 以确定更新程度的大小. 本文在官方数据集MSCOCO上进行了实验, 各种评估机制的实验结果表明, 基于访问控制模块与原始信息注入的图像描述模型与基线模型对比, 取得了更有竞争力的结果, 表现出明显的优越性.

关键词:图像描述;场景图;访问控制;长短期记忆网络;原始信息注入

Abstract:

In recent years, the application of scene graphs in image captioning has been increasingly researched. However, the current image captioning models based on scene graphs cannot take into account the previous input retained in long short-term memory (LSTM) networks, which may lead to missed information. In this study, we firstly propose the image captioning network based on original information injection, which keeps the original input information as much as possible and reduces the missed information. Secondly, we consider that the degree of the current graph updating mechanism is too large, which may lead to the missing of node information. Thus, we propose a visit control module to update the weights of visited nodes, avoiding such missing. Finally, we design a graph update factor (GUF) to determine the update level. We conduct experiments on the official dataset: MSCOCO. The mechanism evaluation shows that our model has achieved more competitive results compared with the baseline model.

Key words:image captioning;scene graph;visiting control;LSTM network;original information injection

参考文献

[1] Chen XL, Fang H, Lin TY, et al. Microsoft COCO captions: Data collection and evaluation server. arXiv: 1504.00325, 2015.

[2] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780. [doi: 10.1162/neco.1997.9.8.1735

[3] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 3156–3164.

[4] Xu K, Ba JL, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille: JMLR.org, 2015. 2048–2057.

[5] Lu JS, Xiong CM, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3242–3250.

[6] Ranzato MA, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks. arXiv: 1511.06732, 2015.

[7] Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1179–1195.

[8] Ren SQ, He KM, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [doi: 10.1109/TPAMI.2016.2577031

[9] Anderson P, He XD, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6077–6086.

[10] Yang X, Tang KH, Zhang HW, et al. Auto-encoding scene graphs for image captioning. Proceedings of 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 10677–10686.

[11] Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 3668–3678.

[12] Li XY, Jiang SQ. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia, 2019, 21(8): 2117–2130. [doi: 10.1109/TMM.2019.2896516

[13] Felzenszwalb PF, Girshick RB, Mcallester D, et al. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9): 1627–1645. [doi: 10.1109/TPAMI.2009.167

[14] Li YK, Ouyang WL, Zhou BL, et al. Factorizable net: An efficient subgraph-based framework for scene graph generation. Proceedings of 15th European Conference on Computer Vision. Munich: Springer, 2018. 346–363.

[15] Xu DF, Zhu YK, Choy C B, et al. Scene graph generation by iterative message passing. Proceedings of 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3097–3106.

[16] Zellers R, Yatskar M, Thomson S, et al. Neural motifs: Scene graph parsing with global context. Proceedings of 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2017. 5831–5840.

[17] Chen SZ, Jin Q, Wang P, et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9959–9968.

[18] Lin TY, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context. Proceedings of 13th European Conference on Computer Vision. Zurich: IEEE, 2014. 740–755.

[19] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.

[20] Lin CY. Rouge: A package for automatic evaluation of summaries. Text summarization branches out. Barcelona: Association for Computational Linguistics, 2004. 74–81.

[21] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: Association for Computational Linguistics, 2005. 65–72.

[22] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.

[23] Chen XL, Zitnick CL. Mind’s eye: A recurrent visual repre-sentation for image caption generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 2422–2431.

[24] Donahue J, Hendricks LA, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 2625–2634.

[25] Mao JH, Xu W, Yang Y, et al. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv: 1412.6632, 2014.

引用本文

李阳,路静,郝宇钦,韦学艳,吴春雷.基于访问控制模块与原始信息注入的图像描述.计算机系统应用,2022,31(7):106-112

复制

文章指标

点击次数:905
下载次数: 27123
HTML阅读次数: 1355
引用次数: 0

历史

收稿日期:2021-10-21
最后修改日期:2021-11-18
录用日期:
在线发布日期: 2022-03-09
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码