基于图像对比增强的大型视觉语言模型物体幻觉缓解
作者:
基金项目:

2023年度沈阳市科学技术计划 (23407329)


Mitigating Object Hallucinations in Large Visual Language Model Through Image Contrast Enhancement
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [35]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    大型视觉语言模型(LVLM)在理解视觉信息和运用语言表达方面展现出了非凡的能力. 然而, 在LVLM进行问答的过程中, 它通常受到物体幻觉问题的困扰, 具体表现为生成的文本内容看似合理, 但实际上却与图片中的信息不相符, 造成了文本与图片之间的不匹配现象. 为解决这一问题, 本文通过实验发现, 物体注意力的缺失是导致物体幻觉的关键因素. 为缓解此问题, 本文引入了图像对比增强方法(ICE). ICE是一种无需训练、操作简便的方法, 通过对比原始视觉输入与增强视觉输入所产生的输出分布, 有效提升模型对图片的感知能力, 确保生成的内容与视觉输入紧密契合, 从而生成上下文一致且准确的输出. 实验结果显示, ICE方法在无需额外训练或外部工具的情况下, 便能显著减轻不同LVLM的物体幻觉问题, 并在大型视觉语言模型基准MME测试中同样表现出色, 验证了其广泛的适用性和有效性. 本文代码链接: ChangGuiyong/ICE.

    Abstract:

    Large visual language model (LVLM) demonstrate remarkable capabilities in understanding visual information and generating verbal expressions. However, LVLM are often affected by the phenomenon of object hallucinations, where the outputs appear plausible but do not align with the visual information in the images. This discrepancy between the generated text and the images presents a significant challenge in achieving accurate image-to-text alignment. To address this issue, this study identifies the lack of object attention as a key factor contributing to object hallucinations. To mitigate this, the proposed image contrast enhancement (ICE) method is introduced. ICE is a simple, user-friendly approach that compares the output distributions from both the original and the augmented visual inputs. This method enhances the model’s ability to perceive images more accurately, ensuring that the generated content aligns closely with the visual input and produces contextually consistent outputs. Experimental results demonstrate that the ICE method effectively mitigates object hallucinations across various LVLM without requiring additional training or external tools. Furthermore, the method performs well on the MME benchmark test for large-scale visual language models, indicating its broad applicability and effectiveness. The code will be released at ChangGuiyong/ICE.

    参考文献
    [1] Zhao WX, Zhou K, Li JY, et al. A survey of large language models. arXiv:2303.18223, 2023.
    [2] Wei J, Wei J, Tay Y, et al. Larger language models do in-context learning differently. arXiv:2303.03846, 2023.
    [3] Bai JZ, Bai S, Chu YF, et al. Qwen technical report. arXiv:2309.16609, 2023.
    [4] Chiang WL, Li Z, Lin Z, et al. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://vicuna.lmsys.org. (2023-04-14) [2024-10-16].
    [5] Lund BD, Wang T. Chatting about ChatGPT: How may AI and GPT impact academia and libraries? Library Hi Tech News, 2023, 40(3): 26–29.
    [6] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
    [7] Yang Z, Li L, Lin K, et al. The dawn of LMMs: Preliminary explorations with GPT-4V (ision). arXiv:2309.17421, 2023.
    [8] Dai WL, Li JN, Li DX, et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 2142.
    [9] Bai JZ, Bai S, Yang SS, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–24.
    [10] Liu HT, Li CY, Wu QY, et al. Visual instruction tuning. Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 1516.
    [11] Liu HT, Li CY, Li YH, et al. Improved baselines with visual instruction tuning. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 26286–26296.
    [12] Ye QH, Xu HY, Xu GH, et al. mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
    [13] Zhu DY, Chen J, Shen XQ, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–17.
    [14] Gemini Team Google, Anil R, Borgeaud S, et al. Gemini: A family of highly capable multimodal models. arXiv:2312.11805, 2023.
    [15] Li YF, Du YF, Zhou K, et al. Evaluating object hallucination in large vision-language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: ACL, 2023. 292–305.
    [16] Yin SK, Fu CY, Zhao SR, et al. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences, 2024, 67(12): 220105.
    [17] Zhou YY, Cui CH, Yoon J, et al. Analyzing and mitigating object hallucination in large vision-language models. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–30.
    [18] Liu FX, Lin K, Li LJ, et al. Mitigating hallucination in large multi-modal models via robust instruction tuning. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–45.
    [19] Yu TY, Yao Y, Zhang HY, et al. RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 13807–13816.
    [20] Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 159.
    [21] Fu CY, Chen PX, Shen YH, et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023
    [22] Li JN, Li DX, Savarese S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the 40th International Conference on Machine Learning. Honolulu: PMLR, 2023. 19730–19742.
    [23] Chen KQ, Zhang Z, Zeng WL, et al. Shikra: Unleashing multimodal LLM’s referential dialogue magic. arXiv:2306.15195, 2023.
    [24] Bavishi R, Elsen E, Hawthorne C, et al. Fuyu-8B: A multimodal architecture for AI agents. https://www.adept.ai/blog/fuyu-8b. (2023-10-17)[2024-10-16] .
    [25] Huang QD, Dong XY, Zhang P, et al. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 13418–13427.
    [26] Leng SC, Zhang H, Chen GZ, et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 13872–13882.
    [27] Chen ZR, Zhao ZK, Luo HY, et al. HALC: Object hallucination reduction via adaptive focal-contrast decoding. Proceedings of the 41st International Conference on Machine Learning. Vienna: JMLR.org, 2024. 307.
    [28] Kirillov A, Mintun E, Ravi N, et al. Segment anything. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023. 3992–4003.
    [29] Agarwal V, Shetty R, Fritz M. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9687–9695.
    [30] Agrawal A, Batra D, Parikh D. Analyzing the behavior of visual question answering models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin: ACL, 2016. 1955–1960.
    [31] Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6325–6334.
    [32] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.
    [33] Schwenk D, Khandelwal A, Clark C, et al. A-OKVQA: A benchmark for visual question answering using world knowledge. Proceedings of the 17th European Conference on Computer Vision. Tel Aviv: Springer, 2022. 146–162.
    [34] Lin TY, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context. Proceedings of the 13th European Conference on Computer Vision (ECCV 2014). Zurich: Springer, 2014. 740–755.
    [35] Hudson DA, Manning CD. GQA: A new dataset for real-world visual reasoning and compositional question answering. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 6693–6702.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

卜立平,常贵勇,于碧辉,刘大伟,魏靖烜,孙林壮,刘龙翼.基于图像对比增强的大型视觉语言模型物体幻觉缓解.计算机系统应用,,():1-9

复制
分享
文章指标
  • 点击次数:9
  • 下载次数: 32
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2024-10-16
  • 最后修改日期:2024-11-29
  • 在线发布日期: 2025-03-31
文章二维码
您是第11318801位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号