Testing the limits of fine-tuning to improve reasoning in vision language models
- URL: http://arxiv.org/abs/2502.15678v1
- Date: Fri, 21 Feb 2025 18:58:30 GMT
- Title: Testing the limits of fine-tuning to improve reasoning in vision language models
- Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz,
- Abstract summary: We introduce visual stimuli and human judgments on visual cognition tasks to evaluate performance across cognitive domains.<n>We fine-tune models on ground truth data for intuitive physics and causal reasoning.<n>We find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
- Score: 51.58859621164201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
Related papers
- Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning.
Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations.
Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ? [5.076961098583674]
The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ.<n>We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision.
arXiv Detail & Related papers (2024-06-11T05:50:34Z) - UniAR: A Unified model for predicting human Attention and Responses on visual content [12.281060227170792]
We propose UniAR -- a unified model of human attention and preference behavior across diverse visual content.
We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks.
Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
arXiv Detail & Related papers (2023-12-15T19:57:07Z) - Visual cognition in multimodal large language models [12.603212933816206]
Recent advancements have rekindled interest in the potential to emulate human-like cognitive abilities.
This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
arXiv Detail & Related papers (2023-11-27T18:58:34Z) - Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models.
These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains.
Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain.
This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation.
We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.