Supervision-free Vision-Language Alignment
- URL: http://arxiv.org/abs/2501.04568v1
- Date: Wed, 08 Jan 2025 15:32:12 GMT
- Title: Supervision-free Vision-Language Alignment
- Authors: Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, Aleix Martinez,
- Abstract summary: We introduce SVP (Supervision-free Visual Projection), a framework that enhances vision-language alignment without relying on curated data or preference annotation.<n>We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall.
- Score: 11.012355590697064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Supervision-free Visual Projection), a novel framework that enhances vision-language alignment without relying on curated data or preference annotation. SVP leverages self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14% average improvement in captioning tasks, up to 12% increase in object recall, and substantial reduction in hallucination rates. Notably, a small VLM using SVP achieves hallucination reductions comparable to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.
Related papers
- Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks.
We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks.
We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z) - Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)
Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.
We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z) - Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.
We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z) - Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models [29.611769371733672]
We propose De Attention (D-Attn), a novel method that processes visual and textual embeddings differently.
D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcalO(|V|2)$ to $mathcalO(|V|)$ for $|V|$ visual embeddings without compromising performance.
arXiv Detail & Related papers (2025-02-04T00:46:11Z) - Probing Visual Language Priors in VLMs [51.016683265437536]
We introduce ViLP, a benchmark that pairs each question with three potential answers and three corresponding images.<n>By leveraging image generative models, we ensure significant variation in texture, shape, conceptual combinations, hallucinated elements, and proverb-based contexts.<n>We propose a self-improving framework in which models generate new VQA pairs and images, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training.
arXiv Detail & Related papers (2024-12-31T17:54:29Z) - FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.
We present benchmarks to assess the model's ability to use image as substantive evidence.
We identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations.
arXiv Detail & Related papers (2024-12-19T09:24:10Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - Mitigating Hallucination in Visual Language Models with Visual
Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot.
Key problem lies in its weak ability to comprehend detailed content in a multi-modal context.
In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.