Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?
- URL: http://arxiv.org/abs/2406.06967v2
- Date: Thu, 30 Jan 2025 14:37:55 GMT
- Title: Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?
- Authors: Kailas Dayanandan, Nikhil Kumar, Anand Sinha, Brejesh Lall,
- Abstract summary: The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ.<n>We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision.
- Score: 5.076961098583674
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The dual thinking framework considers fast, intuitive processing and slower, logical processing. The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ. We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision, which also aids in studying the qualitative behavior of deep learning models. The evidence underscores the importance of shape in identifying instances in human vision. Our psychophysical studies show the presence of multiple inferences in rapid succession, and analysis of errors shows the early stopping of visual processing can result in missing relevant information. Our study shows that segmentation models lack an understanding of sub-structures, as indicated by errors related to the position and number of sub-components. Additionally, the similarity in errors made by models and intuitive human processing indicates that models only address intuitive thinking in human vision. In contrast, multi-modal LLMs, including open-source models, demonstrate tremendous progress on errors made in intuitive processing. The models have improved performance on images that require logical reasoning and show recognition of sub-components. However, they have not matched the performance improvements made on errors in intuitive processing.
Related papers
- VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [89.43196232124883]
VisuoThink is a novel framework that seamlessly integrates visuospatial and linguistic domains.
It enables progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search.
arXiv Detail & Related papers (2025-04-12T08:37:30Z) - A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.
With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.
We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z) - Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [42.022527376404476]
Embodied Reasoner is a model that extends o1 style reasoning to interactive embodied search tasks.
We synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes.
We develop a three-stage training pipeline that progressively enhances the model's capabilities.
arXiv Detail & Related papers (2025-03-27T17:00:51Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.
We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.
We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework using Bongard Problems (BPs) to dissect the perception-reasoning interface in Vision-Language Models (VLMs)
We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.
Our framework provides a valuable diagnostic tool, highlighting the need to enhance visual processing fidelity for achieving more robust and human-like visual intelligence in AI.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks [39.43278448546028]
Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2.
Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks.
This study introduces the textbfCogniDual Framework for LLMs (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses.
arXiv Detail & Related papers (2024-09-05T09:33:24Z) - Graphical Perception of Saliency-based Model Explanations [6.936466872687605]
We study the perception of model explanations, specifically, saliency-based explanations for visual recognition models.
Our findings show that factors related to visualization design decisions, the type of alignment, and qualities of the saliency map all play important roles in how humans perceive saliency-based visual explanations.
arXiv Detail & Related papers (2024-06-11T20:29:25Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Learning Interpretable Concepts: Unifying Causal Representation Learning
and Foundation Models [51.43538150982291]
We study how to learn human-interpretable concepts from data.
Weaving together ideas from both fields, we show that concepts can be provably recovered from diverse data.
arXiv Detail & Related papers (2024-02-14T15:23:59Z) - CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.
After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.
Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Zero-shot visual reasoning through probabilistic analogical mapping [2.049767929976436]
We present visiPAM (visual Probabilistic Analogical Mapping), a model of visual reasoning that synthesizes two approaches.
We show that without any direct training, visiPAM outperforms a state-of-the-art deep learning model on an analogical mapping task.
In addition, visiPAM closely matches the pattern of human performance on a novel task involving mapping of 3D objects across disparate categories.
arXiv Detail & Related papers (2022-09-29T20:29:26Z) - Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises [7.689542442882423]
We designed a dual-stream vision model inspired by the human brain.
This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation.
We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness.
arXiv Detail & Related papers (2022-06-15T03:44:42Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Learnable Visual Words for Interpretable Image Recognition [70.85686267987744]
We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules.
The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories.
Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
arXiv Detail & Related papers (2022-05-22T03:24:45Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - Exploring Alignment of Representations with Human Perception [47.53970721813083]
We show that inputs that are mapped to similar representations by the model should be perceived similarly by humans.
Our approach yields a measure of the extent to which a model is aligned with human perception.
We find that various properties of a model like its architecture, training paradigm, training loss, and data augmentation play a significant role in learning representations that are aligned with human perception.
arXiv Detail & Related papers (2021-11-29T17:26:50Z) - Human-Understandable Decision Making for Visual Recognition [30.30163407674527]
We propose a new framework to train a deep neural network by incorporating the prior of human perception into the model learning process.
The effectiveness of our proposed model is evaluated on two classical visual recognition tasks.
arXiv Detail & Related papers (2021-03-05T02:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.