Related papers: Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

URL: http://arxiv.org/abs/2602.02975v1
Date: Tue, 03 Feb 2026 01:23:22 GMT
Title: Where Norms and References Collide: Evaluating LLMs on Normative Reasoning
Authors: Mitchell Abrams, Kaveh Eskandari Miandoab, Felix Gervits, Vasanth Sarathy, Matthias Scheutz,
Abstract summary: Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms.<n>It remains unclear whether Large Language Models (LLMs) can support this kind of reasoning.<n>We introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR.
Score: 3.8431932182760296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.

Related papers

Social Norm Reasoning in Multimodal Language Models: An Evaluation [0.8181983928344693]
Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms.<n>This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm-related questions based on thirty text-based and thirty image-based stories.<n>Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images.
arXiv Detail & Related papers (2026-03-03T23:48:21Z)
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities [75.10343190811592]
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains.<n>Our benchmark offers a principled and interpretable framework for safe and controllable behavior.
arXiv Detail & Related papers (2026-03-03T03:50:13Z)
Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives [5.120890045747202]
We evaluate large language models' reasoning capabilities in the normative domain from both logical and modal perspectives.<n>Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning.
arXiv Detail & Related papers (2025-10-30T15:35:13Z)
Tool for Supporting Debugging and Understanding of Normative Requirements Using LLMs [3.7885668021375465]
Normative requirements specify social, legal, ethical, empathetic, and cultural (SLEEC) norms that must be observed by a system.<n>These requirements are typically defined by stakeholders in the non-technical system with diverse expertise.<n>SLEEC-LLM improves the efficiency and explainability of normative requirements elicitation and consistency analysis.
arXiv Detail & Related papers (2025-07-07T21:57:28Z)
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning [53.92712851223158]
We formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory.<n>Under the CI framework, we align our model with three critical regulatory standards: EU AI Act, and HIPAA.<n>We employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms.
arXiv Detail & Related papers (2025-05-20T16:40:09Z)
EgoNormia: Benchmarking Physical Social Norm Understanding [52.87904722234434]
EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility.<n>Our work demonstrates that current state-of-the-art vision-language models (VLMs) lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified.
arXiv Detail & Related papers (2025-02-27T19:54:16Z)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.<n> Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z)
Normative Requirements Operationalization with Large Language Models [3.456725053685842]
Normative non-functional requirements specify constraints that a system must observe in order to avoid violations of social, legal, ethical, empathetic, and cultural norms. Recent research has tackled this challenge using a domain-specific language to specify normative requirements. We propose a complementary approach that uses Large Language Models to extract semantic relationships between abstract representations of system capabilities.
arXiv Detail & Related papers (2024-04-18T17:01:34Z)
Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [92.61557711360652]
Language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. We conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement. We reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.
arXiv Detail & Related papers (2023-10-12T17:51:10Z)
CPL-NoViD: Context-Aware Prompt-based Learning for Norm Violation Detection in Online Communities [28.576099654579437]
We introduce Context-aware Prompt-based Learning for Norm Violation Detection (CPL-NoViD) CPL-NoViD outperforms the baseline by incorporating context through natural language prompts. It establishes a new state-of-the-art in norm violation detection, surpassing existing benchmarks.
arXiv Detail & Related papers (2023-05-16T23:27:59Z)
NormSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations On-the-Fly [61.77957329364812]
We introduce a framework for addressing the novel task of conversation-grounded multi-lingual, multi-cultural norm discovery. NormSAGE elicits knowledge about norms through directed questions representing the norm discovery task and conversation context. It further addresses the risk of language model hallucination with a self-verification mechanism ensuring that the norms discovered are correct.
arXiv Detail & Related papers (2022-10-16T18:30:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.