Related papers: HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

URL: http://arxiv.org/abs/2508.10576v2
Date: Wed, 27 Aug 2025 10:04:02 GMT
Title: HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
Authors: Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan, Jingdong Chen, Le Wang,
Abstract summary: HumanSense is a benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs.<n>Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks.<n>We employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model.
Score: 46.59239283399911
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/

Related papers

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach [29.502292089901825]
We argue that this inconsistency stems partly from constraints in existing evaluation methods.<n>We propose an Emotion Statement Judgment task that overcomes these constraints.<n>We devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort.
arXiv Detail & Related papers (2025-09-26T06:30:39Z)
HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes [72.26829188852139]
HumanPCR is an evaluation suite for probing MLLMs' capacity about human-related visual contexts.<n>Human-P, HumanThought-C, and Human-R feature over 6,000 human-verified multiple choice questions.<n>Human-R offers a challenging manually curated video reasoning test.
arXiv Detail & Related papers (2025-08-19T09:52:04Z)
Pixels, Patterns, but No Poetry: To See The World like Humans [33.773551676022514]
State-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans.<n>This paper shifts focus from reasoning to perception.
arXiv Detail & Related papers (2025-07-21T21:50:16Z)
MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind [41.188841829937466]
MoMentS (Multimodal Mental States) is a benchmark for building socially intelligent multimodal agents.<n>MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories.<n>We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively.
arXiv Detail & Related papers (2025-07-06T15:06:30Z)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans [9.315735862658244]
We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
arXiv Detail & Related papers (2025-05-16T11:41:19Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Explainable Recommendation with Simulated Human Feedback [8.532115411106068]
We propose a novel human-like feedback-driven optimization framework for explainable recommendations.<n>This framework employs a dynamic interactive optimization mechanism for achieving human-centered explainable requirements without incurring high labor costs.<n>In particular, we propose to utilize large language models (LLMs) as human simulators to predict human-like feedback for guiding the learning process.
arXiv Detail & Related papers (2025-04-19T02:46:10Z)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z)
VAGUE: Visual Contexts Clarify Ambiguous Expressions [15.140825578254324]
We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent.<n>VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations.<n>Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent.
arXiv Detail & Related papers (2024-11-21T14:01:42Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.