Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
- URL: http://arxiv.org/abs/2602.12405v1
- Date: Thu, 12 Feb 2026 20:55:36 GMT
- Title: Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
- Authors: Carl Qi, Xiaojie Wang, Silong Yong, Stephen Sheng, Huitan Mao, Sriram Srinivasan, Manikantan Nambi, Amy Zhang, Yesh Dattatreya,
- Abstract summary: We introduce ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning.<n>We formulate detection and reasoning as a multi-task self-refinement process.<n>We show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate.
- Score: 16.274791437311602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric. Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate and up to 100% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes. We provide dditional visualizations on our website: https://sites.google.com/utexas.edu/armor
Related papers
- Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization [56.59356959631999]
Gated Perception-Reasoning Optimization (GPRO) is a meta-reasoning controller that dynamically routes computation among three decision paths.<n>GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods.
arXiv Detail & Related papers (2026-01-07T23:05:17Z) - Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models [53.20969621498248]
We propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures.<n>We construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail.<n>We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection.
arXiv Detail & Related papers (2025-12-01T17:57:27Z) - Obstruction reasoning for robotic grasping [18.39507400925748]
We present UNOGrasp, a learning-based vision-language model capable of performing obstruction reasoning.<n>We devise a novel multi-step reasoning process based on obstruction paths originated by the target object.<n>We construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2.
arXiv Detail & Related papers (2025-11-28T13:53:12Z) - VAR: Visual Attention Reasoning via Structured Search and Backtracking [49.427842994857635]
We introduce Visual Attention Reasoning, a framework that recasts grounded reasoning as a structured search.<n> VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought.<n>We show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks.
arXiv Detail & Related papers (2025-10-21T13:18:44Z) - Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z) - Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z) - RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning [27.235259453535537]
RationAnomaly is a novel framework that enhances log anomaly detection by synergizing Chain-of-Thought fine-tuning with reinforcement learning.<n>We have released the corresponding resources, including code and datasets.
arXiv Detail & Related papers (2025-09-18T07:35:58Z) - TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos [48.126793563151715]
No technique effectively detects open-set procedural mistakes online.<n>One branch continuously performs step recognition from the input egocentric video.<n>The other anticipates future steps based on the recognition module's output.
arXiv Detail & Related papers (2024-11-04T20:03:06Z) - State Machine of Thoughts: Leveraging Past Reasoning Trajectories for
Enhancing Problem Solving [6.198707341858042]
We use a state machine to record experience derived from previous reasoning trajectories.
Within the state machine, states represent decomposed sub-problems, while state transitions reflect the dependencies among sub-problems.
Our proposed State Machine of Thoughts (SMoT) selects the most optimal sub-solutions and avoids incorrect ones.
arXiv Detail & Related papers (2023-12-29T03:00:04Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.