基于ViLBERT与BiLSTM的图像描述算法
作者:
基金项目:

国家自然科学基金(61872230, 61802248, 61802249); 上海高校青年教师培养资助计划(ZZsdl18006)


Image Caption Algorithm Based on ViLBERT and BiLSTM
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [30]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    传统图像描述算法存在提取图像特征利用不足、缺少上下文信息学习和训练参数过多的问题, 提出基于ViLBERT和双层长短期记忆网络(BiLSTM)结合的图像描述算法. 使用ViLBERT作为编码器, ViLBERT模型能将图片特征和描述文本信息通过联合注意力的方式进行结合, 输出图像和文本的联合特征向量. 解码器使用结合注意力机制的BiLSTM来生成图像描述. 该算法在MSCOCO2014数据集进行训练和测试, 实验评价标准BLEU-4和BLEU得分分别达到36.9和125.2, 优于基于传统图像特征提取结合注意力机制图像描述算法. 通过生成文本描述对比可看出, 该算法生成的图像描述能够更细致地表述图片信息.

    Abstract:

    Traditional image captioning has the problems of the under-utilization of extracted image features, the lack of context information learning and too many training parameters. This study proposes an image captioning algorithm based on Vision-and-Language BERT (ViLBERT) and Bidirectional Long Short-Term Memory network (BiLSTM). The ViLBERT model is used as an encoder, which can combine image features and descriptive text information through the co-attention mechanism and output the joint feature vector of image and text. The decoder uses a BiLSTM combined with attention mechanism to generate image caption. The algorithm is trained and tested on MSCOCO2014, and the scores of evaluation criteria BLEU-4 and BLEU are 36.9 and 125.2 respectively. This indicates that the proposed algorithm is better than the image captioning based on the traditional image feature extraction combined with the attention mechanism. The comparison of generated text descriptions demonstrates that the image caption generated by this algorithm can describe the image information in more detail.

    参考文献
    [1] Kulkarni G, Premraj V, Ordonez V, et al. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891–2903. [doi: 10.1109/TPAMI.2012.162
    [2] Kuznetsova P, Ordonez V, Berg TL, et al. TREETALK: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2014, 2: 351–362. [doi: 10.1162/tacl_a_00188
    [3] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444. [doi: 10.1038/nature14539
    [4] Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. International Conference on Learning Representations 2015. San Diego: ICLR, 2015.
    [5] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 3156–3164.
    [6] Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 1473–1482.
    [7] Wang YF, Lin Z, Shen XH, et al. Skeleton key: Image captioning by skeleton-attribute decomposition. 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 7378–7387.
    [8] Aneja J, Deshpande A, Schwing AG. Convolutional image captioning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 5561–5570.
    [9] Xu K, Ba JL, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. Lille: ACM, 2015. 2048–2057.
    [10] Lu JS, Xiong CM, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 3242–3250.
    [11] Anderson P, He XD, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6077–6086.
    [12] Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. Montreal: ACM, 2014. 2672–2680.
    [13] Dai B, Fidler S, Urtasun R, et al. Towards diverse and natural image descriptions via a conditional GAN. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 2989–2998.
    [14] Shetty R, Rohrbach M, Hendricks L A, et al. Speaking the same language: Matching machine to human captions by adversarial training. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 4155–4164.
    [15] Zhang H, Xu T, Li HS, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 5908–5916.
    [16] Shekhar R, Pezzelle S, Klimovich Y, et al. FOIL it! find one mismatch between image and language caption. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: Association for Computational Linguistics, 2017. 255–265.
    [17] Dai B, Lin DH. Contrastive learning for image captioning. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: ACM, 2017. 898–907.
    [18] Ranzato MA, Chopra S, Auli M, et al. Sequence level training with recurrent neural networks. 4th International Conference on Learning Representations. San Juan: ICLR, 2016.
    [19] Liu SQ, Zhu ZH, Ye N, et al. Improved image captioning via policy gradient optimization of spider. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 873–881.
    [20] Ren Z, Wang XY, Zhang N, et al. Deep reinforcement learning-based image captioning with embedding reward. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 1151–1159.
    [21] Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 1179–1195.
    [22] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed. Cambridge: MIT Press, 2018.
    [23] Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.
    [24] Correia GM, Niculae V, Martins AFT. Adaptively sparse transformers. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: Association for Computational Linguistics, 2019. 2174–2184.
    [25] Cornegruta S, Bakewell R, Withey S, et al. Modelling radiological language with bidirectional long short-term memory networks. Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis. Austin: Association for Computational Linguistics, 2016. 17–27.
    [26] Lu JS, Batra D, Parikh D, et al. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2019. 13–23.
    [27] Chen XL, Fang H, Lin TY, et al. Microsoft COCO captions: Data collection and evaluation server. arXiv: 1504.00325v2, 2015.
    [28] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: ACM, 2002. 311–318.
    [29] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: ACL, 2005. 65–72.
    [30] Feng Y, Ma L, Liu W, et al. Unsupervised image captioning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 4120–4129.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

许昊,张凯,田英杰,种法广,王子超.基于ViLBERT与BiLSTM的图像描述算法.计算机系统应用,2021,30(11):195-202

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-12-29
  • 最后修改日期:2021-02-03
  • 在线发布日期: 2021-10-22
文章二维码
您是第11208049位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号