Related papers: Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans

Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans

URL: http://arxiv.org/abs/2505.11141v2
Date: Sat, 24 May 2025 02:18:52 GMT
Title: Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
Authors: Yansheng Qiu, Li Xiao, Zhaopan Xu, Pengfei Zhou, Zheng Wang, Kaipeng Zhang,
Abstract summary: We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
Score: 9.315735862658244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The goal of achieving Artificial General Intelligence (AGI) is to imitate humans and surpass them. Models such as OpenAI's o1, o3, and DeepSeek's R1 have demonstrated that large language models (LLMs) with human-like reasoning capabilities exhibit exceptional performance and are being gradually integrated into multimodal large language models (MLLMs). However, whether these models possess capabilities comparable to humans in handling reasoning tasks remains unclear at present. In this paper, we propose Human-Aligned Bench, a benchmark for fine-grained alignment of multimodal reasoning with human performance. Specifically, we collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions, encompassing four question types: visual reasoning, definition judgment, analogical reasoning, and logical judgment. More importantly, each question is accompanied by human success rates and options that humans are prone to choosing incorrectly. Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance. The findings on our benchmark provide insights into the development of the next-generation models.

Related papers

Pixels, Patterns, but No Poetry: To See The World like Humans [33.773551676022514]
State-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans.<n>This paper shifts focus from reasoning to perception.
arXiv Detail & Related papers (2025-07-21T21:50:16Z)
MANBench: Is Your Multimodal Model Smarter than Human? [7.483339020254684]
We introduce MANBench, a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks.<n>We compared human performance against state-of-the-art Multimodal Large Language Models (MLLMs)<n>Results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks.<n>We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities.
arXiv Detail & Related papers (2025-06-04T08:42:14Z)
Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.7206754497888035]
We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Giving AI Personalities Leads to More Human-Like Reasoning [7.124736158080938]
We investigate the potential of AI to mimic diverse reasoning behaviors across a human population.<n>We designed reasoning tasks using a novel generalization of the Natural Language Inference (NLI) format.<n>We used personality-based prompting inspired by the Big Five personality model to elicit AI responses reflecting specific personality traits.
arXiv Detail & Related papers (2025-02-19T23:51:23Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models [2.9312156642007294]
We systematically review Large Language Models' capabilities across three important cognitive domains: decision-making biases, reasoning, and creativity.<n>On decision-making, our synthesis reveals that while LLMs demonstrate several human-like biases, some biases observed in humans are absent.<n>On reasoning, advanced LLMs like GPT-4 exhibit deliberative reasoning akin to human System-2 thinking, while smaller models fall short of human-level performance.<n>A distinct dichotomy emerges in creativity: while LLMs excel in language-based creative tasks, such as storytelling, they struggle with divergent thinking tasks that require real-world context.
arXiv Detail & Related papers (2024-12-20T02:26:56Z)
MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions. Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z)
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z)
MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.