FiVL: A Framework for Improved Vision-Language Alignment
- URL: http://arxiv.org/abs/2412.14672v1
- Date: Thu, 19 Dec 2024 09:24:10 GMT
- Title: FiVL: A Framework for Improved Vision-Language Alignment
- Authors: Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal,
- Abstract summary: We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.
These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence.
To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability.
- Score: 10.184567639685321
- License:
- Abstract: Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model's reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at https://github.com/IntelLabs/fivl.
Related papers
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.
We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.
Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - Improving Fine-grained Visual Understanding in VLMs through Text-Only Training [0.0]
We investigate the feasibility of enhancing fine-grained visual understanding in Visual-Language Models (VLMs) through text-only training.
We conduct comprehensive experiments on two distinct domains: fine-grained species classification and cultural visual understanding tasks.
Our findings demonstrate that text-only training can be comparable to conventional image-text training while significantly reducing computational costs.
arXiv Detail & Related papers (2024-12-17T14:18:50Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - A-VL: Adaptive Attention for Large Vision-Language Models [10.027871150748956]
Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential.
Current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models.
We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns.
We develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference.
arXiv Detail & Related papers (2024-09-23T09:22:59Z) - Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - Learning without Forgetting for Vision-Language Models [86.53237963364754]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.