Related papers: Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

URL: http://arxiv.org/abs/2502.15678v2
Date: Fri, 30 May 2025 08:45:42 GMT
Title: Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models
Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz,
Abstract summary: We introduce visual stimuli and human judgments on visual cognition tasks to evaluate performance across cognitive domains.<n>We fine-tune models on ground truth data for intuitive physics and causal reasoning.<n>We find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics.
Score: 51.58859621164201
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Related papers

Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning [18.13538667261998]
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background.<n>We build a visual system that mimicks human visual camouflaged perception to progressively and iteratively refocus' visual concealed content.
arXiv Detail & Related papers (2025-05-26T07:27:18Z)
CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models [18.891323067948285]
It is unclear to what degree vision-language models emulate human behavior on tasks that involve reasoning about data visualizations.<n>Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans.<n>We found that these models performed worse than human participants on average.
arXiv Detail & Related papers (2025-05-22T18:15:04Z)
Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning [69.71072181304066]
We introduce Perceptive Dexterous Control (PDC), a framework for vision-driven whole-body control with simulated humanoids.<n>PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues.<n>We show that training from scratch with reinforcement learning can produce emergent behaviors such as active search.
arXiv Detail & Related papers (2025-05-18T07:33:31Z)
Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning. Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations. Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system. We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z)
Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ? [5.076961098583674]
The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ.<n>We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision.
arXiv Detail & Related papers (2024-06-11T05:50:34Z)
UniAR: A Unified model for predicting human Attention and Responses on visual content [12.281060227170792]
We propose UniAR -- a unified model of human attention and preference behavior across diverse visual content. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
arXiv Detail & Related papers (2023-12-15T19:57:07Z)
Visual cognition in multimodal large language models [12.603212933816206]
Recent advancements have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
arXiv Detail & Related papers (2023-11-27T18:58:34Z)
Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models. These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z)
What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions. We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors. Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.