Probing Visual Language Priors in VLMs
- URL: http://arxiv.org/abs/2501.00569v2
- Date: Sun, 16 Feb 2025 00:34:39 GMT
- Title: Probing Visual Language Priors in VLMs
- Authors: Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee,
- Abstract summary: Despite advances in Vision-Language Models, they may over-rely on visual language priors existing in their training data rather than true visual reasoning.<n>We introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs.<n>Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning.
- Score: 51.016683265437536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q\&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17\% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.
Related papers
- Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)
Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.
We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z) - Supervision-free Vision-Language Alignment [11.012355590697064]
We introduce SVP (Supervision-free Visual Projection), a framework that enhances vision-language alignment without relying on curated data or preference annotation.
We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall.
arXiv Detail & Related papers (2025-01-08T15:32:12Z) - How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z) - Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision [6.8582563015193]
Weakly-supervised vision-language pre-training aims at learning cross-modal alignment with little or no paired data.
Recent methods, which pair visual features with object tags, help achieve performances comparable with some models trained with aligned pairs in various V-L downstream tasks.
We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH)
WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.
arXiv Detail & Related papers (2022-10-24T20:30:55Z) - Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective [52.52870614418373]
Aligning cross-modal semantics is claimed to be one of the essential capabilities of vision and language pre-training models.
We propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of fjord models.
arXiv Detail & Related papers (2022-10-18T02:55:58Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.