﻿ 基于双路细化注意力机制的图像描述模型
 计算机系统应用  2020, Vol. 29 Issue (5): 245-251 PDF

Image Captioning Based on Dual Refined Attention
CONG Lu-Wen
College of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
Abstract: Image captioning is an important task, which connects computer vision and natural language processing, two major artificial intelligence fields. In recent years, encoder-decoder frameworks integrated with attention mechanism have made significant process in captioning. However, many attention-based methods only use spatial attention mechanism. In this study, we propose a novel dual refined attention model for image captioning. In the proposed model, we use not only spatial attention but also channel-wise attention and then use a refine module to refine the image features. By using the refine module, the proposed model can filter the redundant and irrelevant features in the attended image features. We validate the proposed model on MSCOCO dataset via various evaluation metrics, and the results show the effectiveness of the proposed model.
Key words: image captioning     spatial attention     channel-wise attention     Long Short Term Memory (LSTM)     computer vision

1 引言

2 相关工作

3 模型

3.1 编码器

 ${{A}} = \{ {{{a}}_1}, \cdots ,{{{a}}_L}\} ,\;\;{{{a}}_i} \in {{\mathbb R}^D}$ (1)

 ${{{a}}^g} = \frac{1}{L}\sum\limits_{i = 1}^L {{{{a}}_i}}$ (2)

 ${{{q}}_i} = {{ReLU}}({{{W}}_a}{{{a}}_i})$ (3)
 ${{{q}}^g} = {{ReLU}}({{{W}}_b}{{{a}}^g})$ (4)

 图 1 整体框架

 图 2 解码器结构

3.2 空间注意力模型

 ${ {\widehat{h}}_t} = {{Conv}}({{{h}}_t})\,$ (5)
 ${{z}}_t^s\; = {{w}}_{hs}^{\rm T}\tanh ({{{W}}_{qs}}{{Q}} + ({{{W}}_{ss}}{{\widehat {h}}_t}){{{1}}^{\rm T}})$ (6)
 ${{{\alpha }}_t} = {{Softmax}}({{z}}_t^s)$ (7)

 ${{{V}}_t} = \sum\limits_{i = 1}^L {{\alpha _{ti}}{{{q}}_i}}$ (8)

3.3 通道注意力模型

Zhou等[13]发现每个隐藏单元可以与不同的语义概念对齐. 然而, 在基于空间注意力的模型中, 通道特征是相同的, 忽略了语义差异. 如图2所示, 本文同时也采用了通道注意力机制. 将局部区域特征 ${{Q}} \in {{\mathbb R}^{d \times L}}$ 与解码器的当前时刻的经过卷积的隐藏状态 ${{\widehat {h}}_t}$ 输入单层感知机中, 随后用Softmax函数计算局部图像特征在通道上的注意力分布:

 ${{z}}_t^c = {{w}}_{hc}^{\rm T}({{{W}}_{qc}}{{{Q}}^{\rm T}} + ({{{W}}_{sc}}{{\widehat {h}}_t}){{{1}}^{\rm T}})$ (9)
 ${{{\beta }}_t} = {{Softmax}}({{z}}_t^c)$ (10)

 ${{{U}}_t} = \sum\limits_{i = 1}^d {{\beta _{ti}}{{Q}}_i^{\rm T}}$ (11)

3.4 特征细化模块

 ${{V}}_t^\prime = {{{W}}_{vd}}{{{V}}_t}$ (12)
 ${{U}}_t^\prime = {{{W}}_{ud}}{{{U}}_t}$ (13)
 ${{h}}_n^v = {f_{\rm LSTM}}({{V}}_t^\prime ,\;{{h}}_{n - 1}^v)$ (14)
 ${{h}}_n^u = {f_{\rm LSTM}}({{U}}_t^\prime ,\;{{h}}_{n - 1}^u)$ (15)
 ${{\widehat {V}}_t} = {{h}}_n^v$ (16)
 ${{\widehat {U}}_t} = {{h}}_n^u$ (17)

3.5 解码器

LSTM通常用于现有的图像描述模型中, 因为LSTM在对长期依赖关系建模方面具有强大的力量. 本文遵循常用的LSTM结构, 基本LSTM块中的门控单元和存储单元定义如下:

 $\left\{\begin{array}{l} {{{x}}_t} = [{{{W}}_e}{y_{t - 1}};{q^g}],\;\;{\rm{ for }}\;t \ge 1 \\ {{{f}}_t} = \sigma ({{{W}}_{fx}}{{{x}}_t} + {{{W}}_{fh}}{{{h}}_{t - 1}} + {{{b}}_f}) \\ {{{i}}_t} = \sigma ({{{W}}_{ix}}{{{x}}_t} + {{{W}}_{ih}}{{{h}}_{t - 1}} + {{{b}}_i}) \\ {{{o}}_t} = \sigma ({{{W}}_{ox}}{{{x}}_t} + {{{W}}_{oh}}{{{h}}_{t - 1}} + {{{b}}_o}) \\ {{{c}}_t} = {{{f}}_t} \odot {{{c}}_{t - 1}} + {{{i}}_t} \odot \tanh ({{{W}}_{cx}}{{{x}}_t} + {{{W}}_{ch}}{{{h}}_{t - 1}} + {{{b}}_c}) \\ {{{h}}_t} = {{{o}}_t} \odot \tanh ({{{c}}_t}) \\ \end{array} \right.$ (18)

 $p({y_t}|{y_1}, \cdots ,{y_{t - 1}},\;{{I}}) = {{Softmax}}({{{W}}_p}({{{h}}_t} + {{\widehat {U}}_t} + {{\widehat {V}}_t}))$ (19)

 ${{{L}}_{{{XE}}}}(\theta ) = - \sum\limits_{t = 1}^T {\log } ({p_\theta }(y_t^*|y_1^*, \cdots ,y_{t - 1}^*))$ (20)
 ${{{L}}_R} = - {{{E}}_{{y_{1:T}}\sim{p_\theta }}}\left[ {r({y_{1:T}})} \right]$ (21)

4 实验分析 4.1 实验数据集与评价标准

4.2 实现细节

4.3 实验对比方法介绍

Goole NIC[2]使用编码器-解码器框架, 使用卷积神经网络作为编码器, 使用LSTM作为解码器.

Hard-Attention[9]将空间注意力机制引入图像描述模型, 根据解码器的状态动态地为图像不同区域的特征分配权重.

MSM[6]共同利用了图像属性信息与图像全局特征.

Att2all[8]首次提出并使用了SCST训练方法.

SCA-CNN[12]同时使用了空间与通道注意力.

4.4 实验分析

5 结论与展望

 [1] Ren SQ, He KM, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015. 91–99. [2] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3156–3164. [3] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: 1502.03167, 2015. [4] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780. DOI:10.1162/neco.1997.9.8.1735 [5] Wu Q, Shen CH, Liu LQ, et al. What value do explicit high level concepts have in vision to language problems? Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 203–212. [6] Yao T, Pan YW, Li YH, et al. Boosting image captioning with attributes. Proceedings of the 20017 IEEE International Conference on Computer Vision. Venice, Italy. 2017. 4894–4902. [7] Ranzato MA, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks. arXiv preprint arXiv: 1511.06732, 2015. [8] Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 7008–7024. [9] Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv: 1502.03044, 2015. 2048–2057. [10] You QZ, Jin HL, Wang ZW, et al. Image captioning with semantic attention. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 4651–4659. [11] Lu JS, Xiong CM, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 375–383. [12] Chen L, Zhang HW, Xiao J, et al. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 5659–5667. [13] Zhou BL, Bau D, Oliva A, et al. Interpreting deep visual representations via network dissection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(9): 2131-2145. DOI:10.1109/TPAMI.2018.2858759 [14] Lin TY, Maire M, Belongie S, et al. Microsoft coco: Common objects in context. Proceedings of European Conference on Computer Vision (ECCV 2014). Zurich, Switzerland. 2014. 740–755. [15] Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3128–3137. [16] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, PA, USA. 2002. 311–318. [17] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, MI, USA. 2005. 65–72. [18] Lin CY. Rouge: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out. Barcelona, Spain. 2004. 74–81. [19] Vedantam R, Zitnick CL, Parikh D. Cider: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 4566–4575. [20] Chen XL, Fang H, Lin TY, et al. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv: 1504.00325, 2015. [21] Kingma DP, Ba J. Adam: A method for stochastic optimi-zation. arXiv preprint arXiv: 1412.6980, 2015. [22] 周治平, 张威. 结合视觉属性注意力和残差连接的图像描述生成模型. 计算机辅助设计与图形学学报, 2018, 30(8): 1536-1542, 1553.