Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
- URL: http://arxiv.org/abs/2502.13928v1
- Date: Wed, 19 Feb 2025 18:05:42 GMT
- Title: Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
- Authors: Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, Nick Haber,
- Abstract summary: Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.
We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
- Score: 7.823336661261962
- License:
- Abstract: Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/
Related papers
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.
We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.
Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models [29.611769371733672]
We propose De Attention (D-Attn), a novel method that processes visual and textual embeddings differently.
D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcalO(|V|2)$ to $mathcalO(|V|)$ for $|V|$ visual embeddings without compromising performance.
arXiv Detail & Related papers (2025-02-04T00:46:11Z) - Supervision-free Vision-Language Alignment [11.012355590697064]
We introduce SVP (Supervision-free Visual Projection), a framework that enhances vision-language alignment without relying on curated data or preference annotation.
We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall.
arXiv Detail & Related papers (2025-01-08T15:32:12Z) - Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.