Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions
- URL: http://arxiv.org/abs/2501.10011v1
- Date: Fri, 17 Jan 2025 07:48:37 GMT
- Title: Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions
- Authors: Zhijie Tan, Yuzhi Li, Shengwei Meng, Xiang Yuan, Weiping Li, Tong Mo, Bingce Wang, Xu Chu,
- Abstract summary: Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA)
This paper proposes a novel method to mitigate HoOA in LVLMs.
- Score: 13.16300262271362
- License:
- Abstract: Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.
Related papers
- Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation [10.468784974994465]
The projector plays a crucial role in multi-modal language models (MLLMs)
Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency.
A Spatial-Aware Efficient Projector (SAEP) is proposed to address this issue.
arXiv Detail & Related papers (2024-10-14T09:25:09Z) - Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models [15.029014337718849]
Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities.
In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively.
We propose a non-targeted attack method referred to as VT-Attack, which constructs adversarial examples from multiple perspectives.
arXiv Detail & Related papers (2024-10-09T09:06:56Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.