Related papers: HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

URL: http://arxiv.org/abs/2508.13692v1
Date: Tue, 19 Aug 2025 09:52:04 GMT
Title: HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
Authors: Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen,
Abstract summary: HumanPCR is an evaluation suite for probing MLLMs' capacity about human-related visual contexts.<n>Human-P, HumanThought-C, and Human-R feature over 6,000 human-verified multiple choice questions.<n>Human-R offers a challenging manually curated video reasoning test.
Score: 72.26829188852139
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.

Related papers

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs [46.59239283399911]
HumanSense is a benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs.<n>Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks.<n>We employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model.
arXiv Detail & Related papers (2025-08-14T12:14:15Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans [9.315735862658244]
We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
arXiv Detail & Related papers (2025-05-16T11:41:19Z)
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding [57.763735969891286]
We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA)<n>HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene.<n>We present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum.
arXiv Detail & Related papers (2025-03-17T09:10:50Z)
HumanVLM: Foundation for Human-Scene Vision-Language Model [3.583459930633303]
Human-scene vision-language tasks are increasingly prevalent in diverse social applications. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM) In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance.
arXiv Detail & Related papers (2024-11-05T12:14:57Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
Human Simulacra: Benchmarking the Personification of Large Language Models [38.21708264569801]
Large language models (LLMs) are recognized as systems that closely mimic aspects of human intelligence.<n>This paper introduces a framework for constructing virtual characters' life stories from the ground up.<n> Experimental results demonstrate that our constructed simulacra can produce personified responses that align with their target characters.
arXiv Detail & Related papers (2024-02-28T09:11:14Z)
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception [37.667147915777534]
Human-centric perception is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP) Human Query captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios.
arXiv Detail & Related papers (2023-12-09T10:36:43Z)
Hulk: A Universal Knowledge Translator for Human-Centric Tasks [69.8518392427151]
We present Hulk, the first multimodal human-centric generalist model.<n>It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning.<n>Hulk achieves state-of-the-art performance in 11 benchmarks.
arXiv Detail & Related papers (2023-12-04T07:36:04Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z)
UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP) UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.