Related papers: Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

URL: http://arxiv.org/abs/2603.05147v1
Date: Thu, 05 Mar 2026 13:14:41 GMT
Title: Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
Authors: Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci,
Abstract summary: We propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state.<n>Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators.
Score: 7.802379200026965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.

Related papers

LIBERO-X: Robustness Litmus for Vision-Language-Action Models [32.29541801424534]
This work systematically rethinks VLA benchmarking from both evaluation and data perspectives.<n>We introduce LIBERO-X, a benchmark featuring a hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities.<n> Experiments with representative VLA models reveal significant performance drops under cumulative perturbations.
arXiv Detail & Related papers (2026-02-06T09:59:12Z)
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models [21.133970394496327]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control.<n>Current test-time scaling (TTS) methods require additional training, verifiers, and multiple forward passes, making them impractical for deployment.<n>We propose a simple inference strategy that jointly modulates visual perception and action based on'self-uncertainty'
arXiv Detail & Related papers (2026-02-04T04:48:16Z)
ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z)
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z)
PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z)
Learning Affordances at Inference-Time for Vision-Language-Action Models [50.93181349331096]
In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks.<n>We introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences.<n>Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution.
arXiv Detail & Related papers (2025-10-22T16:43:29Z)
Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks [38.058215007885096]
Self-evaluation for large language models (LLMs) incurs high computational overhead and introduces overconfidence issues due to intrinsic biases.<n>We propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement.
arXiv Detail & Related papers (2025-09-27T02:44:05Z)
Do What? Teaching Vision-Language-Action Models to Reject the Impossible [53.40183895299108]
Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks.<n>We propose Instruct-Verify-and-Act (IVA), a framework that detects when an instruction cannot be executed due to a false premise.<n>Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines.
arXiv Detail & Related papers (2025-08-22T10:54:33Z)
Self-Corrective Task Planning by Inverse Prompting with Large Language Models [9.283971287618261]
We introduce InversePrompt, a novel self-corrective task planning approach.<n>Our method incorporates reasoning steps to provide clear, interpretable feedback.<n>Results on benchmark datasets show an average 16.3% higher success rate over existing LLM-based task planning methods.
arXiv Detail & Related papers (2025-03-10T13:35:51Z)
Transparent and Coherent Procedural Mistake Detection [30.540514590818265]
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user has successfully executed a task (specified by a procedural text)<n>We extend PMD to require generating visual self-dialog rationales to inform decisions.<n>Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames.
arXiv Detail & Related papers (2024-12-16T16:13:55Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.