Related papers: ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

URL: http://arxiv.org/abs/2602.01709v2
Date: Tue, 03 Feb 2026 03:19:49 GMT
Title: ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation
Authors: Xingshan Zeng, Lingzhi Wang, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu,
Abstract summary: ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, is a framework that decouples exploration from commitment.<n>We show that naive LLM-based simulators struggle to capture rare but high-impact failure modes.<n>We introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions.
Score: 72.78362530982109
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.

Related papers

AgentCyTE: Leveraging Agentic AI to Generate Cybersecurity Training & Experimentation Scenarios [0.19999259391104388]
We present AgentCyTE, a framework integrating large language models with deterministic, schema-constrained network emulation.<n>AgentCyTE observes scenario outcomes, validates correctness, and iteratively enhances realism and consistency.
arXiv Detail & Related papers (2025-10-29T05:44:12Z)
UF-RNN: Real-Time Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction [4.849928323880955]
Training robots to operate effectively in environments with uncertain states remains a longstanding challenge in robotics.<n>We propose the Uncertainty-driven Foresight Recurrent Neural Network (UF-RNN), a model that combines standard time-series prediction with an active "Foresight" module.<n>UF-RNN exhibits robust adaptation by leveraging self-induced chaotic dynamics in its latent space.<n>These findings suggest that integrating uncertainty-driven foresight into imitation learning pipelines can significantly enhance a robot's ability to handle unpredictable real-world conditions.
arXiv Detail & Related papers (2025-10-11T13:44:20Z)
AL-SPCE -- Reliability analysis for nondeterministic models using stochastic polynomial chaos expansions and active learning [0.0]
Many real-world systems display intrinsic randomness, requiring simulators whose outputs are random variables.<n>While Monte Carlo methods can handle this, their high computational cost is often prohibitive.<n>This work introduces an active learning framework to further reduce the computational burden of reliability analysis using emulators.
arXiv Detail & Related papers (2025-07-06T22:07:57Z)
Active Test-time Vision-Language Navigation [60.69722522420299]
ATENA is a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes.<n>In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration.<n>In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions.
arXiv Detail & Related papers (2025-06-07T02:24:44Z)
RIFT: Group-Relative RL Fine-Tuning for Realistic and Controllable Traffic Simulation [13.319344167881383]
We introduce a dual-stage AV-centric simulation framework that conducts imitation learning pre-training in a data-driven simulator.<n>We then learn fine-tuning in a physics-based simulator to enhance style-level controllability.<n>In the fine-tuning stage, we propose RIFT, a novel group-relative RL fine-tuning strategy.
arXiv Detail & Related papers (2025-05-06T09:12:37Z)
RADE: Learning Risk-Adjustable Driving Environment via Multi-Agent Conditional Diffusion [17.46462636610847]
Risk- Driving Environment (RADE) is a simulation framework that generates statistically realistic and risk-adjustable traffic scenes.<n>RADE learns risk-conditioned behaviors directly from data, preserving naturalistic multi-agent interactions with controllable risk levels.<n>We validate RADE on the real-world rounD dataset, demonstrating that it preserves statistical realism across varying risk levels.
arXiv Detail & Related papers (2025-05-06T04:41:20Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs [50.41998220099097]
Data errors, corruptions, and poisoning attacks during training pose a major threat to the reliability of modern AI systems.<n>We introduce MIBP-Cert, a novel certification method based on mixed-integer bilinear programming (MIBP)<n>By computing the set of parameters reachable through perturbed or manipulated data, we can predict all possible outcomes and guarantee robustness.
arXiv Detail & Related papers (2024-12-13T14:56:39Z)
Active Sequential Posterior Estimation for Sample-Efficient Simulation-Based Inference [12.019504660711231]
We introduce sequential neural posterior estimation (ASNPE)<n>ASNPE brings an active learning scheme into the inference loop to estimate the utility of simulation parameter candidates to the underlying probabilistic model.<n>Our method outperforms well-tuned benchmarks and state-of-the-art posterior estimation methods on a large-scale real-world traffic network.
arXiv Detail & Related papers (2024-12-07T08:57:26Z)
GraphSCENE: On-Demand Critical Scenario Generation for Autonomous Vehicles in Simulation [11.896059467313668]
This work introduces a novel method that generates dynamic temporal scene graphs corresponding to diverse traffic scenarios, on-demand, tailored to user-defined preferences.<n>A temporal Graph Neural Network (GNN) model learns to predict relationships between ego-vehicle agents and static structures, guided by real-world interaction patterns.<n>We render the predicted scenarios in simulation to further demonstrate their effectiveness as testing environments for AV agents.
arXiv Detail & Related papers (2024-10-17T13:02:06Z)
SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries [94.84458417662407]
We introduce SAFE-SIM, a controllable closed-loop safety-critical simulation framework. Our approach yields two distinct advantages: 1) generating realistic long-tail safety-critical scenarios that closely reflect real-world conditions, and 2) providing controllable adversarial behavior for more comprehensive and interactive evaluations. We validate our framework empirically using the nuScenes and nuPlan datasets across multiple planners, demonstrating improvements in both realism and controllability.
arXiv Detail & Related papers (2023-12-31T04:14:43Z)
INTAGS: Interactive Agent-Guided Simulation [4.04638613278729]
In many applications involving multi-agent system (MAS), it is imperative to test an experimental (Exp) autonomous agent in a high-fidelity simulator prior to its deployment to production. We propose a metric to distinguish between real and synthetic multi-agent systems, which is evaluated through the live interaction between the Exp and BG agents. We show that using INTAGS to calibrate the simulator can generate more realistic market data compared to the state-of-the-art conditional Wasserstein Generative Adversarial Network approach.
arXiv Detail & Related papers (2023-09-04T19:56:18Z)
REX: Rapid Exploration and eXploitation for AI Agents [103.68453326880456]
We propose an enhanced approach for Rapid Exploration and eXploitation for AI Agents called REX. REX introduces an additional layer of rewards and integrates concepts similar to Upper Confidence Bound (UCB) scores, leading to more robust and efficient AI agent performance.
arXiv Detail & Related papers (2023-07-18T04:26:33Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.