Related papers: HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

URL: http://arxiv.org/abs/2505.11454v3
Date: Fri, 01 Aug 2025 02:38:04 GMT
Title: HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
Authors: Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya,
Abstract summary: Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding.<n>HumaniBench is a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite.<n>HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality.
Score: 38.614841553065766
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through diverse open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. All data and code are publicly available for reproducibility. Keywords: HumaniBench, vision-language models, responsible AI benchmark, AI alignment evaluation, AI ethics assessment, fairness in AI models, visual question answering (VQA) benchmark, image captioning evaluation, visual grounding tasks, trustworthy AI models, Chain-of-Thought prompting, test-time scaling, ethical AI development tools.

Related papers

Measuring AI Alignment with Human Flourishing [0.0]
This paper introduces the Flourishing AI Benchmark (FAI Benchmark), a novel evaluation framework that assesses AI alignment with human flourishing.<n>The Benchmark measures AI performance on how effectively models contribute to the flourishing of a person across seven dimensions.<n>This research establishes a framework for developing AI systems that actively support human flourishing rather than merely avoiding harm.
arXiv Detail & Related papers (2025-07-10T14:09:53Z)
Perceptual Quality Assessment for Embodied AI [66.96928199019129]
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories.<n>There is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots.
arXiv Detail & Related papers (2025-05-22T15:51:07Z)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans [9.315735862658244]
We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
arXiv Detail & Related papers (2025-05-16T11:41:19Z)
Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.7206754497888035]
We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z)
Turing Representational Similarity Analysis (RSA): A Flexible Method for Measuring Alignment Between Human and Artificial Intelligence [0.62914438169038]
We developed Turing Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans.<n>We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels.
arXiv Detail & Related papers (2024-11-30T20:24:52Z)
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.<n>Each task features carefully crafted diagrams paired with function signatures and test cases.<n>We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
A-Bench: Are LMMs Masters at Evaluating AI-generated Images? [78.3699767628502]
A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs)<n>Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
arXiv Detail & Related papers (2024-06-05T08:55:02Z)
Quality Assessment for AI Generated Images with Instruction Tuning [58.41087653543607]
We first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+.<n>This paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning.
arXiv Detail & Related papers (2024-05-12T17:45:11Z)
Hulk: A Universal Knowledge Translator for Human-Centric Tasks [69.8518392427151]
We present Hulk, the first multimodal human-centric generalist model. It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. Hulk achieves state-of-the-art performance in 11 benchmarks.
arXiv Detail & Related papers (2023-12-04T07:36:04Z)
Who's Thinking? A Push for Human-Centered Evaluation of LLMs using the XAI Playbook [30.985555463848264]
We draw parallels between the relatively mature field of XAI and the rapidly evolving research boom around large language models. We argue that humans' tendencies should rest front and center when evaluating deployed large language models.
arXiv Detail & Related papers (2023-03-10T22:15:49Z)
HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining [75.1086193340286]
It is desirable to have a general pretrain model for versatile human-centric downstream tasks. We propose a textbfHumanBench based on existing datasets to evaluate on the common ground the generalization abilities of different pretraining methods. Our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets.
arXiv Detail & Related papers (2023-03-10T02:57:07Z)
Aligning AI With Shared Human Values [85.2824609130584]
We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. We find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
arXiv Detail & Related papers (2020-08-05T17:59:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.