Related papers: HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

URL: http://arxiv.org/abs/2505.11454v5
Date: Sun, 09 Nov 2025 23:48:51 GMT
Title: HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
Authors: Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya,
Abstract summary: Large multimodal models (LMMs) have achieved impressive performance on vision tasks such as visual question answering (VQA), image captioning, and visual grounding.<n>HumaniBench is a benchmark comprising 32,000 real-world image-question pairs and an accompanying evaluation suite.<n>It assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality.
Score: 44.973773675725674
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large multimodal models (LMMs) have achieved impressive performance on vision-language tasks such as visual question answering (VQA), image captioning, and visual grounding; however, they remain insufficiently evaluated for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce HumaniBench, a comprehensive benchmark comprising 32,000 real-world image-question pairs and an accompanying evaluation suite. Using a semi-automated annotation pipeline, each sample is rigorously validated by domain experts to ensure accuracy and ethical integrity. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality through a diverse set of open- and closed-ended VQA tasks. Grounded in AI ethics theory and real-world social contexts, these principles provide a holistic lens for examining human-aligned behavior. Benchmarking results reveal distinct behavioral patterns: certain model families excel in reasoning, fairness, and multilinguality, while others demonstrate greater robustness and grounding capability. However, most models still struggle to balance task accuracy with ethical and inclusive responses. Techniques such as chain-of-thought prompting and test-time scaling yield measurable alignment gains. As the first benchmark explicitly designed for HC evaluation, HumaniBench offers a rigorous testbed to diagnose limitations, quantify alignment trade-offs, and promote the responsible development of large multimodal models. All data and code are publicly released to ensure transparency and reproducibility. https://vectorinstitute.github.io/HumaniBench/

Related papers

HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z)
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
Measuring AI Alignment with Human Flourishing [0.0]
This paper introduces the Flourishing AI Benchmark (FAI Benchmark), a novel evaluation framework that assesses AI alignment with human flourishing.<n>The Benchmark measures AI performance on how effectively models contribute to the flourishing of a person across seven dimensions.<n>This research establishes a framework for developing AI systems that actively support human flourishing rather than merely avoiding harm.
arXiv Detail & Related papers (2025-07-10T14:09:53Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions [36.010107260144586]
SoMi-ToM benchmark is designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions.<n>We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions.<n>Results show that LVLMs perform significantly worse than humans on SoMi-ToM.
arXiv Detail & Related papers (2025-06-29T00:54:13Z)
Perceptual Quality Assessment for Embodied AI [66.96928199019129]
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories.<n>There is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots.
arXiv Detail & Related papers (2025-05-22T15:51:07Z)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans [9.315735862658244]
We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
arXiv Detail & Related papers (2025-05-16T11:41:19Z)
Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.7206754497888035]
We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Turing Representational Similarity Analysis (RSA): A Flexible Method for Measuring Alignment Between Human and Artificial Intelligence [0.62914438169038]
We developed Turing Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans.<n>We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels.
arXiv Detail & Related papers (2024-11-30T20:24:52Z)
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.<n>Each task features carefully crafted diagrams paired with function signatures and test cases.<n>We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
A-Bench: Are LMMs Masters at Evaluating AI-generated Images? [78.3699767628502]
A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs)<n>Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
arXiv Detail & Related papers (2024-06-05T08:55:02Z)
Quality Assessment for AI Generated Images with Instruction Tuning [58.41087653543607]
We first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+.<n>This paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning.
arXiv Detail & Related papers (2024-05-12T17:45:11Z)
Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z)
Hulk: A Universal Knowledge Translator for Human-Centric Tasks [69.8518392427151]
We present Hulk, the first multimodal human-centric generalist model. It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. Hulk achieves state-of-the-art performance in 11 benchmarks.
arXiv Detail & Related papers (2023-12-04T07:36:04Z)
Who's Thinking? A Push for Human-Centered Evaluation of LLMs using the XAI Playbook [30.985555463848264]
We draw parallels between the relatively mature field of XAI and the rapidly evolving research boom around large language models. We argue that humans' tendencies should rest front and center when evaluating deployed large language models.
arXiv Detail & Related papers (2023-03-10T22:15:49Z)
HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining [75.1086193340286]
It is desirable to have a general pretrain model for versatile human-centric downstream tasks. We propose a textbfHumanBench based on existing datasets to evaluate on the common ground the generalization abilities of different pretraining methods. Our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets.
arXiv Detail & Related papers (2023-03-10T02:57:07Z)
Aligning AI With Shared Human Values [85.2824609130584]
We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. We find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
arXiv Detail & Related papers (2020-08-05T17:59:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.