基于图像对比增强的大型视觉语言模型物体幻觉缓解

doi:10.15888/j.cnki.csa.009881

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月9日 21:53 星期三

首页 > 过刊浏览>年第卷第期 >1-9. DOI:10.15888/j.cnki.csa.009881

PDF HTML阅读 XML下载导出引用引用提醒

基于图像对比增强的大型视觉语言模型物体幻觉缓解
DOI:
                        10.15888/j.cnki.csa.009881
                    
CSTR:
                        
                    
作者:
                        卜立平卜立平
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
常贵勇常贵勇
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
于碧辉于碧辉
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
刘大伟刘大伟
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
魏靖烜魏靖烜
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
孙林壮孙林壮
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
刘龙翼刘龙翼
中国科学院 沈阳计算技术研究所, 沈阳 110168;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:2023年度沈阳市科学技术计划 (23407329)

Mitigating Object Hallucinations in Large Visual Language Model Through Image Contrast Enhancement

Author:

BU Li-Ping
BU Li-Ping
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
CHANG Gui-Yong
CHANG Gui-Yong
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
YU Bi-Hui
YU Bi-Hui
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Da-Wei
LIU Da-Wei
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
WEI Jing-Xuan
WEI Jing-Xuan
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
SUN Lin-Zhuang
SUN Lin-Zhuang
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Long-Yi
LIU Long-Yi
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [35]

相似文献

引证文献

资源附件

文章评论

摘要:

大型视觉语言模型(LVLM)在理解视觉信息和运用语言表达方面展现出了非凡的能力. 然而, 在LVLM进行问答的过程中, 它通常受到物体幻觉问题的困扰, 具体表现为生成的文本内容看似合理, 但实际上却与图片中的信息不相符, 造成了文本与图片之间的不匹配现象. 为解决这一问题, 本文通过实验发现, 物体注意力的缺失是导致物体幻觉的关键因素. 为缓解此问题, 本文引入了图像对比增强方法(ICE). ICE是一种无需训练、操作简便的方法, 通过对比原始视觉输入与增强视觉输入所产生的输出分布, 有效提升模型对图片的感知能力, 确保生成的内容与视觉输入紧密契合, 从而生成上下文一致且准确的输出. 实验结果显示, ICE方法在无需额外训练或外部工具的情况下, 便能显著减轻不同LVLM的物体幻觉问题, 并在大型视觉语言模型基准MME测试中同样表现出色, 验证了其广泛的适用性和有效性. 本文代码链接: ChangGuiyong/ICE.

关键词:大型视觉语言模型;物体幻觉;图像对比增强;人工智能

Abstract:

Large visual language model (LVLM) demonstrate remarkable capabilities in understanding visual information and generating verbal expressions. However, LVLM are often affected by the phenomenon of object hallucinations, where the outputs appear plausible but do not align with the visual information in the images. This discrepancy between the generated text and the images presents a significant challenge in achieving accurate image-to-text alignment. To address this issue, this study identifies the lack of object attention as a key factor contributing to object hallucinations. To mitigate this, the proposed image contrast enhancement (ICE) method is introduced. ICE is a simple, user-friendly approach that compares the output distributions from both the original and the augmented visual inputs. This method enhances the model’s ability to perceive images more accurately, ensuring that the generated content aligns closely with the visual input and produces contextually consistent outputs. Experimental results demonstrate that the ICE method effectively mitigates object hallucinations across various LVLM without requiring additional training or external tools. Furthermore, the method performs well on the MME benchmark test for large-scale visual language models, indicating its broad applicability and effectiveness. The code will be released at ChangGuiyong/ICE.

Key words:large visual language model (LVLM);object hallucination;image contrast enhancement (ICE);artificial intelligence

参考文献

[1] Zhao WX, Zhou K, Li JY, et al. A survey of large language models. arXiv:2303.18223, 2023.

[2] Wei J, Wei J, Tay Y, et al. Larger language models do in-context learning differently. arXiv:2303.03846, 2023.

[3] Bai JZ, Bai S, Chu YF, et al. Qwen technical report. arXiv:2309.16609, 2023.

[4] Chiang WL, Li Z, Lin Z, et al. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://vicuna.lmsys.org. (2023-04-14) [2024-10-16].

[5] Lund BD, Wang T. Chatting about ChatGPT: How may AI and GPT impact academia and libraries? Library Hi Tech News, 2023, 40(3): 26–29.

[6] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.

[7] Yang Z, Li L, Lin K, et al. The dawn of LMMs: Preliminary explorations with GPT-4V (ision). arXiv:2309.17421, 2023.

[8] Dai WL, Li JN, Li DX, et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 2142.

[9] Bai JZ, Bai S, Yang SS, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–24.

[10] Liu HT, Li CY, Wu QY, et al. Visual instruction tuning. Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 1516.

[11] Liu HT, Li CY, Li YH, et al. Improved baselines with visual instruction tuning. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 26286–26296.

[12] Ye QH, Xu HY, Xu GH, et al. mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.

[13] Zhu DY, Chen J, Shen XQ, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–17.

[14] Gemini Team Google, Anil R, Borgeaud S, et al. Gemini: A family of highly capable multimodal models. arXiv:2312.11805, 2023.

[15] Li YF, Du YF, Zhou K, et al. Evaluating object hallucination in large vision-language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: ACL, 2023. 292–305.

[16] Yin SK, Fu CY, Zhao SR, et al. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences, 2024, 67(12): 220105.

[17] Zhou YY, Cui CH, Yoon J, et al. Analyzing and mitigating object hallucination in large vision-language models. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–30.

[18] Liu FX, Lin K, Li LJ, et al. Mitigating hallucination in large multi-modal models via robust instruction tuning. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024. 1–45.

[19] Yu TY, Yao Y, Zhang HY, et al. RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 13807–13816.

[20] Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 159.

[21] Fu CY, Chen PX, Shen YH, et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023

[22] Li JN, Li DX, Savarese S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the 40th International Conference on Machine Learning. Honolulu: PMLR, 2023. 19730–19742.

[23] Chen KQ, Zhang Z, Zeng WL, et al. Shikra: Unleashing multimodal LLM’s referential dialogue magic. arXiv:2306.15195, 2023.

[24] Bavishi R, Elsen E, Hawthorne C, et al. Fuyu-8B: A multimodal architecture for AI agents. https://www.adept.ai/blog/fuyu-8b. (2023-10-17)[2024-10-16] .

[25] Huang QD, Dong XY, Zhang P, et al. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 13418–13427.

[26] Leng SC, Zhang H, Chen GZ, et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024. 13872–13882.

[27] Chen ZR, Zhao ZK, Luo HY, et al. HALC: Object hallucination reduction via adaptive focal-contrast decoding. Proceedings of the 41st International Conference on Machine Learning. Vienna: JMLR.org, 2024. 307.

[28] Kirillov A, Mintun E, Ravi N, et al. Segment anything. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023. 3992–4003.

[29] Agarwal V, Shetty R, Fritz M. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9687–9695.

[30] Agrawal A, Batra D, Parikh D. Analyzing the behavior of visual question answering models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin: ACL, 2016. 1955–1960.

[31] Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6325–6334.

[32] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.

[33] Schwenk D, Khandelwal A, Clark C, et al. A-OKVQA: A benchmark for visual question answering using world knowledge. Proceedings of the 17th European Conference on Computer Vision. Tel Aviv: Springer, 2022. 146–162.

[34] Lin TY, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context. Proceedings of the 13th European Conference on Computer Vision (ECCV 2014). Zurich: Springer, 2014. 740–755.

[35] Hudson DA, Manning CD. GQA: A new dataset for real-world visual reasoning and compositional question answering. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 6693–6702.

引用本文

卜立平,常贵勇,于碧辉,刘大伟,魏靖烜,孙林壮,刘龙翼.基于图像对比增强的大型视觉语言模型物体幻觉缓解.计算机系统应用,,():1-9

复制

文章指标

点击次数:9
下载次数: 32
HTML阅读次数: 0
引用次数: 0

历史

收稿日期:2024-10-16
最后修改日期:2024-11-29
录用日期:
在线发布日期: 2025-03-31
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码