Intriguing Properties of Large Language and Vision Models
- URL: http://arxiv.org/abs/2410.04751v1
- Date: Mon, 7 Oct 2024 05:07:01 GMT
- Title: Intriguing Properties of Large Language and Vision Models
- Authors: Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Yechan Hwang, Ho-Jin Choi,
- Abstract summary: Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
- Score: 18.449076451976236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.
Related papers
- Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.
They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.
We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.381263829108405]
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment.
We introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents.
arXiv Detail & Related papers (2025-02-05T14:29:01Z) - Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task [3.2228025627337864]
Advancing machine visual reasoning requires a deeper understanding of how Vision-Language Models (VLMs) process and interpret complex visual patterns.
This work introduces a novel, cognitively-inspired evaluation framework to systematically analyze VLM reasoning on natural image-based Bongard Problems.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.
In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.
We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models.
We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks.
Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps.
Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
arXiv Detail & Related papers (2025-01-10T18:59:51Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.
Each task features carefully crafted diagrams paired with function signatures and test cases.
We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.