Related papers: Social Norm Reasoning in Multimodal Language Models: An Evaluation

Social Norm Reasoning in Multimodal Language Models: An Evaluation

URL: http://arxiv.org/abs/2603.03590v1
Date: Tue, 03 Mar 2026 23:48:21 GMT
Title: Social Norm Reasoning in Multimodal Language Models: An Evaluation
Authors: Oishik Chowdhury, Anushka Debnath, Bastin Tony Roy Savarimuthu,
Abstract summary: Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms.<n>This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm-related questions based on thirty text-based and thirty image-based stories.<n>Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images.
Score: 0.8181983928344693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Multi-Agent Systems (MAS), agents are designed with social capabilities, allowing them to understand and reason about social concepts such as norms when interacting with others (e.g., inter-robot interactions). In Normative MAS (NorMAS), researchers study how norms develop, and how violations are detected and sanctioned. However, existing research in NorMAS use symbolic approaches (e.g., formal logic) for norm representation and reasoning whose application is limited to simplified environments. In contrast, Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms in a wide variety of complex social situations embodied in text and images. However, prior work on norm reasoning have been limited to text-based scenarios. This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm-related questions based on thirty text-based and thirty image-based stories, and comparing their responses against humans. Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images. GPT-4o performs the best in both modalities offering the most promise for integration with MAS, followed by the free model Qwen-2.5VL. Additionally, all models find reasoning about complex norms challenging.

Related papers

Where Norms and References Collide: Evaluating LLMs on Normative Reasoning [3.8431932182760296]
Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms.<n>It remains unclear whether Large Language Models (LLMs) can support this kind of reasoning.<n>We introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR.
arXiv Detail & Related papers (2026-02-03T01:23:22Z)
Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives [5.120890045747202]
We evaluate large language models' reasoning capabilities in the normative domain from both logical and modal perspectives.<n>Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning.
arXiv Detail & Related papers (2025-10-30T15:35:13Z)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z)
On Path to Multimodal Generalist: General-Level and General-Bench [153.9720740167528]
This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality.<n>At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation.<n>The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists.
arXiv Detail & Related papers (2025-05-07T17:59:32Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
EgoNormia: Benchmarking Physical Social Norm Understanding [52.87904722234434]
EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility.<n>Our work demonstrates that current state-of-the-art vision-language models (VLMs) lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified.
arXiv Detail & Related papers (2025-02-27T19:54:16Z)
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models [61.88413918026431]
Social reasoning abilities are crucial for AI systems to interpret and respond to multimodal human communication and interaction within social contexts.<n>We introduce SOCIAL GENOME, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models.
arXiv Detail & Related papers (2025-02-21T00:05:40Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
Normative Requirements Operationalization with Large Language Models [3.456725053685842]
Normative non-functional requirements specify constraints that a system must observe in order to avoid violations of social, legal, ethical, empathetic, and cultural norms. Recent research has tackled this challenge using a domain-specific language to specify normative requirements. We propose a complementary approach that uses Large Language Models to extract semantic relationships between abstract representations of system capabilities.
arXiv Detail & Related papers (2024-04-18T17:01:34Z)
Harnessing the power of LLMs for normative reasoning in MASs [3.1796285054362605]
Large Language Models (LLMs) offer rich and expressive vocabulary for norms. LLMs can perform a range of tasks such as norm discovery, normative reasoning and decision-making. This paper aims to foster collaboration between MAS, NLP and LLM researchers in order to advance the field of normative agents.
arXiv Detail & Related papers (2024-03-25T08:09:01Z)
Emergence of Social Norms in Generative Agent Societies: Principles and Architecture [8.094425852451643]
We propose a novel architecture, named CRSEC, to empower the emergence of social norms within generative MASs. Our architecture consists of four modules: Creation & Representation, Spreading, Evaluation, and Compliance. Our experiments demonstrate the capability of our architecture to establish social norms and reduce social conflicts within generative MASs.
arXiv Detail & Related papers (2024-03-13T05:08:10Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.