Related papers: Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

URL: http://arxiv.org/abs/2602.16346v2
Date: Thu, 19 Feb 2026 10:44:43 GMT
Title: Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Authors: Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut,
Abstract summary: STING (Sequential Testing of Illicit N-step Goal execution) is an automated red-teaming framework.<n>It constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups.<n>We introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable.
Score: 35.76774274440008
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Related papers

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification [25.817251923574286]
We propose a novel inference-time detection and mitigation framework for large language model (LLM) agents.<n>AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover.<n>We evaluate AgentSentry on the textscAgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs.
arXiv Detail & Related papers (2026-02-26T07:59:10Z)
Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents [68.20752678837377]
We propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences.<n>Using this taxonomy, we construct MT-AgentRisk, the first benchmark to evaluate multi-turn tool-using agent safety.<n>We propose ToolShield, a training-free, tool-agnostic, self-exploration defense.
arXiv Detail & Related papers (2026-02-13T18:38:18Z)
TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning [4.928838343487574]
Existing uncertainty proxies focus on single-shot text generation.<n>We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction.
arXiv Detail & Related papers (2026-02-11T22:23:56Z)
ComAgent: Multi-LLM based Agentic AI Empowered Intelligent Wireless Networks [62.031889234230725]
6G networks rely on complex cross-layer optimization.<n> manually translating high-level intents into mathematical formulations remains a bottleneck.<n>We present ComAgent, a multi-LLM agentic AI framework.
arXiv Detail & Related papers (2026-01-27T13:43:59Z)
Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks [22.908904483320953]
Large Language Models (LLMs) in coding tasks are often a reflection of their extensive pre-training corpora.<n>We propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives.<n>We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks.
arXiv Detail & Related papers (2026-01-16T09:06:47Z)
Multi-Agent Tool-Integrated Policy Optimization [67.12841355267678]
Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks.<n>Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses.<n>No existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks.
arXiv Detail & Related papers (2025-10-06T10:44:04Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
MAPS: A Multilingual Benchmark for Global Agent Performance and Security [8.275240552134338]
We propose MAPS, a benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks.<n>We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances.<n>We observe degradation in both performance and security when transitioning from English to other languages.
arXiv Detail & Related papers (2025-05-21T18:42:00Z)
Multi-lingual Multi-turn Automated Red Teaming for LLMs [4.707861373629172]
Multi-lingual Multi-turn Automated Red Teaming (textbfMM-ART) is a method to fully automate conversational, multi-lingual red-teaming operations.<n>We show the studied LLMs are on average 71% more vulnerable after a 5-turn conversation in English than after the initial turn.<n>For conversations in non-English languages, models display up to 195% more safety vulnerabilities than the standard single-turn English approach.
arXiv Detail & Related papers (2025-04-04T05:06:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.