M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs
- URL: http://arxiv.org/abs/2412.20718v1
- Date: Mon, 30 Dec 2024 05:18:55 GMT
- Title: M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs
- Authors: Bei Yan, Jie Zhang, Zhiyuan Chen, Shiguang Shan, Xilin Chen,
- Abstract summary: We introduce M$3$oralBench, the first MultiModal Moral Benchmark for LVLMs.<n>M$3$oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images.<n>It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response.
- Score: 66.78407469042642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large foundation models, including large language models (LLMs) and large vision-language models (LVLMs), have become essential tools in critical fields such as law, finance, and healthcare. As these models increasingly integrate into our daily life, it is necessary to conduct moral evaluation to ensure that their outputs align with human values and remain within moral boundaries. Previous works primarily focus on LLMs, proposing moral datasets and benchmarks limited to text modality. However, given the rapid development of LVLMs, there is still a lack of multimodal moral evaluation methods. To bridge this gap, we introduce M$^3$oralBench, the first MultiModal Moral Benchmark for LVLMs. M$^3$oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images. It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response, providing a comprehensive assessment of model performance in multimodal moral understanding and reasoning. Extensive experiments on 10 popular open-source and closed-source LVLMs demonstrate that M$^3$oralBench is a challenging benchmark, exposing notable moral limitations in current models. Our benchmark is publicly available.
Related papers
- Are Language Models Sensitive to Morally Irrelevant Distractors? [47.92026843851412]
We show that moral distractors can shift the moral judgements of large language models by over 30% even in low-ambiguity scenarios.<n>This research challenges theories that assume the stability of human moral judgements.
arXiv Detail & Related papers (2026-02-10T05:18:05Z) - MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment [48.39756797294967]
We present MM-SCALE, a dataset for aligning Vision-Language Models with human moral preferences.<n>Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans.<n>Our framework provides richer alignment signals and finer calibration of multimodal moral reasoning.
arXiv Detail & Related papers (2026-02-03T15:48:00Z) - Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language Models [41.633874062439254]
It remains unclear whether Vision-Language Models (VLMs) are stable in realistic settings.<n>We probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile.<n>We show that lightweight inference-time interventions can partially restore moral stability.
arXiv Detail & Related papers (2026-01-23T06:00:09Z) - MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z) - Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants [0.36326779753373206]
The recent rise in popularity of large language models (LLMs) has prompted considerable concerns about their moral capabilities.<n>This paper examines their capacity to function as Artificial Moral Assistants (AMAs)<n>We argue that qualifying as an AMA requires more than what state-of-the-art alignment techniques aim to achieve.
arXiv Detail & Related papers (2025-08-18T09:28:55Z) - Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs [0.0]
Moral competence is the ability to act in accordance with moral principles.<n>As large language models (LLMs) are increasingly deployed in situations demanding moral competence, there is increasing interest in evaluating this ability empirically.<n>We identify three significant shortcoming: (i) Over-reliance on prepackaged moral scenarios with explicitly highlighted moral features; (ii) Focus on verdict prediction rather than moral reasoning; and (iii) Inadequate testing of models' (in)ability to recognize when additional information is needed.
arXiv Detail & Related papers (2025-06-16T03:59:38Z) - Are Language Models Consequentialist or Deontological Moral Reasoners? [69.85385952436044]
We focus on a large-scale analysis of the moral reasoning traces provided by large language models (LLMs)<n>We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology.
arXiv Detail & Related papers (2025-05-27T17:51:18Z) - When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas [68.79830818369683]
Large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>There is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives.<n>We introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts.
arXiv Detail & Related papers (2025-05-25T16:19:24Z) - MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models [38.0475868976819]
Vision-language models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis.<n>We introduce MORALISE, a benchmark for evaluating the moral alignment of vision-language models using diverse, expert-verified real-world data.
arXiv Detail & Related papers (2025-05-20T01:11:17Z) - From Stability to Inconsistency: A Study of Moral Preferences in LLMs [4.12484724941528]
We introduce a Moral Foundations LLM dataset (MFD-LLM) grounded in Moral Foundations Theory.
We propose a novel evaluation method that captures the full spectrum of LLMs' revealed moral preferences by answering a range of real-world moral dilemmas.
Our findings reveal that state-of-the-art models have remarkably homogeneous value preferences, yet demonstrate a lack of consistency.
arXiv Detail & Related papers (2025-04-08T11:52:50Z) - The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas [0.3386560551295745]
We evaluate the moral judgments of LLMs using utilitarian dilemmas.
Our analysis reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards.
arXiv Detail & Related papers (2025-03-25T12:29:53Z) - Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.
Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.
We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z) - LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - MoralBench: Moral Evaluation of LLMs [34.43699121838648]
This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of large language models (LLMs)
We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs.
Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance.
arXiv Detail & Related papers (2024-06-06T18:15:01Z) - Exploring and steering the moral compass of Large Language Models [55.2480439325792]
Large Language Models (LLMs) have become central to advancing automation and decision-making across various sectors.
This study proposes a comprehensive comparative analysis of the most advanced LLMs to assess their moral profiles.
arXiv Detail & Related papers (2024-05-27T16:49:22Z) - Denevil: Towards Deciphering and Navigating the Ethical Values of Large
Language Models via Instruction Learning [36.66806788879868]
Large Language Models (LLMs) have made unprecedented breakthroughs, yet their integration into everyday life might raise societal risks due to generated unethical content.
This work delves into ethical values utilizing Moral Foundation Theory.
arXiv Detail & Related papers (2023-10-17T07:42:40Z) - Rethinking Machine Ethics -- Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? [78.3738172874685]
Making moral judgments is an essential step toward developing ethical AI systems.
Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality.
This work proposes a flexible top-down framework to steer (Large) Language Models (LMs) to perform moral reasoning with well-established moral theories from interdisciplinary research.
arXiv Detail & Related papers (2023-08-29T15:57:32Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.