Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
Abstract Overview
Sentinel-VLA is a metacognitive vision-language-action model that incorporates an active status monitor module to determine when reasoning (planning, subtask updates, or error recovery) is needed during robotic manipulation, rather than reasoning at every timestep. When the status is classified as "Normal," the model reuses existing thought memory and directly outputs actions, reducing computational overhead to 13 ms per action. The model is trained using EC-Gen, an automatic pipeline that synthesizes error-recovery trajectories and annotations, producing a dataset covering 44 RLBench tasks with over 2.6 million transitions. The paper also introduces Self-Evolving Continual Learning (SECL) with an Orthogonal Continual Adapter (OC-Adapter) to expand capabilities while mitigating catastrophic forgetting. Experiments are conducted on RLBench, LIBERO-LONG, and real-world manipulation tasks using a Piper robot arm.
Novelty
The paper's primary contribution is a unified VLA architecture that integrates an independent status monitor expert for on-demand reasoning and error recovery, replacing always-on chain-of-thought or external correction modules. It also introduces EC-Gen, an automatic pipeline for synthesizing error-recovery training data through stochastic perturbation of expert trajectories, and an Orthogonal Continual Adapter that constrains new adapter updates to be orthogonal to previously learned parameter spaces to prevent catastrophic forgetting.
Results
Sentinel-VLA achieves 63.5% on RLBench seen tasks, 51.3% on unseen tasks, 90.7% on LIBERO-LONG, and 60.0% on real-world tasks, outperforming PI0 (57.8%, 42.0%, 85.2%, and 46.0% respectively) across all settings. The status monitor achieves a 97.4% error detection rate in simulation and 90.6% in real-world evaluation, while inference runs at 13 ms per action—substantially faster than CoT-based methods like ECoT (1528 ms) and comparable to non-reasoning baselines.
Key Points
- Sentinel-VLA uses a dedicated status monitor expert to classify execution states (Initial, Normal, New-subtask, Error) and triggers reasoning only when needed, achieving 13 ms per action inference time while maintaining robust decision-making.
- The EC-Gen pipeline automatically generates large-scale error-recovery training data (2.6M+ transitions across 44 tasks) by injecting three types of perturbations (interaction, spatial, semantic) into expert trajectories and annotating recovery sequences.
- The SECL algorithm with OC-Adapter enables continual learning by constraining new adapter updates to an orthogonal space relative to existing knowledge, yielding 60.0% real-world success versus 44.7% when standard LoRA is used without the orthogonality constraint.