Related papers: Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making

Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making

URL: http://arxiv.org/abs/2510.03514v1
Date: Fri, 03 Oct 2025 20:55:04 GMT
Title: Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making
Authors: Toby Drinkall,
Abstract summary: This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour.<n>We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine.<n>We evaluate three frontier models, GPT-4o, Gemini-2.5, and LLaMA-3.1, through 90 multi-agent, multi-turn crisis simulations.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As military organisations consider integrating large language models (LLMs) into command and control (C2) systems for planning and decision support, understanding their behavioural tendencies is critical. This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour by comparing LLMs acting as agents in multi-turn simulated conflict. We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine: Civilian Target Rate (CTR) and Dual-use Target Rate (DTR) assess compliance with legal targeting principles, while Mean and Max Simulated Non-combatant Casualty Value (SNCV) quantify tolerance for civilian harm. We evaluate three frontier models, GPT-4o, Gemini-2.5, and LLaMA-3.1, through 90 multi-agent, multi-turn crisis simulations across three geographic regions. Our findings reveal that off-the-shelf LLMs exhibit concerning and unpredictable targeting behaviour in simulated conflict environments. All models violated the IHL principle of distinction by targeting civilian objects, with breach rates ranging from 16.7% to 66.7%. Harm tolerance escalated through crisis simulations with MeanSNCV increasing from 16.5 in early turns to 27.7 in late turns. Significant inter-model variation emerged: LLaMA-3.1 selected an average of 3.47 civilian strikes per simulation with MeanSNCV of 28.4, while Gemini-2.5 selected 0.90 civilian strikes with MeanSNCV of 17.6. These differences indicate that model selection for deployment constitutes a choice about acceptable legal and moral risk profiles in military operations. This work seeks to provide a proof-of-concept of potential behavioural risks that could emerge from the use of LLMs in Decision Support Systems (AI DSS) as well as a reproducible benchmarking framework with interpretable metrics for standardising pre-deployment testing.

Related papers

LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations [2.430361444826172]
Large language models (LLMs) are increasingly proposed as agents in strategic decision environments.<n>We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios.<n>We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory.
arXiv Detail & Related papers (2026-03-02T17:46:17Z)
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents [4.851169906977996]
We introduce a new benchmark comprising 40 distinct scenarios.<n>Each scenario presents a task that requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI)<n>We observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 models exhibiting misalignment rates between 30% and 50%.
arXiv Detail & Related papers (2025-12-23T21:52:53Z)
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts [54.15982476754607]
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks.<n>This study defines complicit facilitation as the provision of guidance or support that enables illicit user instructions.<n>Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents.
arXiv Detail & Related papers (2025-11-25T16:01:31Z)
Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment [52.374772443536045]
HALF (Harm-Aware LLM Fairness) is a framework that assesses model bias in realistic applications and weighs the outcomes by harm severity.<n>We show that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
arXiv Detail & Related papers (2025-10-14T07:13:26Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making [0.9310318514564272]
Large language models can navigate explicit conflicts between legitimately different cultural value systems.<n>CCD-Bench is a benchmark that assesses decision-making under cross-cultural value conflict.<n>CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making.
arXiv Detail & Related papers (2025-10-03T22:55:37Z)
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z)
Pre-Trained Policy Discriminators are General Reward Models [81.3974586561645]
We propose a scalable pre-training method named Policy Discriminative Learning (POLAR)<n>POLAR trains a reward model (RM) to discern identical policies and discriminate different ones.<n> Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods.
arXiv Detail & Related papers (2025-07-07T16:56:31Z)
Adversarial Preference Learning for Robust LLM Alignment [24.217309343426297]
Adversarial Preference Learning (APL) is an iterative adversarial training method incorporating three key innovations.<n>First, a direct harmfulness metric based on the model's intrinsic preference probabilities.<n>Second, a conditional generative attacker that synthesizes input-specific adversarial variations.
arXiv Detail & Related papers (2025-05-30T09:02:07Z)
Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models [2.11457423143017]
This study presents a novel benchmark designed to evaluate the biases and preferences of seven prominent foundation models.<n>We used 400-expert crafted scenarios to analyze results from our selected models.<n>All models exhibit some degree of country-specific biases, often recommending less escalatory and interventionist actions for China and Russia.
arXiv Detail & Related papers (2025-03-08T16:19:13Z)
EgoNormia: Benchmarking Physical Social Norm Understanding [52.87904722234434]
EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility.<n>Our work demonstrates that current state-of-the-art vision-language models (VLMs) lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified.
arXiv Detail & Related papers (2025-02-27T19:54:16Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z)
Distortions in Judged Spatial Relations in Large Language Models [45.875801135769585]
GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism.
arXiv Detail & Related papers (2024-01-08T20:08:04Z)
Escalation Risks from Language Models in Military and Diplomatic Decision-Making [0.0]
This work aims to scrutinize the behavior of multiple AI agents in simulated wargames. We design a novel wargame simulation and scoring framework to assess the risks of the escalation of actions taken by these agents. We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons.
arXiv Detail & Related papers (2024-01-07T07:59:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.