Related papers: Failure Prediction at Runtime for Generative Robot Policies

Failure Prediction at Runtime for Generative Robot Policies

URL: http://arxiv.org/abs/2510.09459v2
Date: Mon, 13 Oct 2025 13:29:31 GMT
Title: Failure Prediction at Runtime for Generative Robot Policies
Authors: Ralf Römer, Adrian Kobras, Luca Worbis, Angela P. Schoellig,
Abstract summary: Early failure prediction during runtime is essential for deploying robots in human-centered and safety-critical environments.<n>We propose FIPER, a framework for failure prediction for generative robot policies that does not require failure data.<n>Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods.
Score: 6.375597233389154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.

Related papers

Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models [53.20969621498248]
We propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures.<n>We construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail.<n>We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection.
arXiv Detail & Related papers (2025-12-01T17:57:27Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data [76.18834864749606]
LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm.<n>Existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level.<n>We introduce AuraGen, a controllable engine that synthesizes benign trajectories, injects category-labeled risks with difficulty, and filters outputs via an automated reward model.
arXiv Detail & Related papers (2025-10-10T18:42:32Z)
Revisiting Multivariate Time Series Forecasting with Missing Values [74.56971641937771]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z)
Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent [9.370683025542686]
We present a proactive replanning framework that detects and corrects failures at subtask boundaries.<n>Experiments in the AI2-THOR simulator demonstrate that our approach detects semantic and spatial mismatches before execution failures occur.
arXiv Detail & Related papers (2025-08-15T07:48:51Z)
Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies [19.27526590452503]
FAIL-Detect is a two-stage approach for failure detection in imitation learning-based robotic manipulation.<n>We first distill policy inputs and outputs into scalar signals that correlate with policy failures and capture uncertainty.<n>Our experiments show learned signals to be mostly consistently effective, particularly when using our novel flow-based density estimator.
arXiv Detail & Related papers (2025-03-11T15:47:12Z)
Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress [31.952925824381325]
We propose a runtime monitoring framework that splits the detection of failures into two complementary categories. We use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone.
arXiv Detail & Related papers (2024-10-06T22:13:30Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z)
Optimal decision making in robotic assembly and other trial-and-error tasks [1.0660480034605238]
We study a class of problems providing (1) low-entropy indicators of terminal success / failure, and (2) unreliable (high-entropy) data to predict the final outcome of an ongoing task. We derive a closed form solution that predicts makespan based on the confusion matrix of the failure predictor. This allows the robot to learn failure prediction in a production environment, and only adopt a preemptive policy when it actually saves time.
arXiv Detail & Related papers (2023-01-25T22:07:50Z)
Tracking the risk of a deployed model and detecting harmful distribution shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.