When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
- URL: http://arxiv.org/abs/2602.17659v1
- Date: Thu, 19 Feb 2026 18:59:20 GMT
- Title: When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
- Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding,
- Abstract summary: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language.<n>We show that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs.<n>We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme.
- Score: 31.92520697946991
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
Related papers
- LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models [4.54067274409672]
Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks.<n>We find that current state-of-the-art VLA models largely ignore language instructions.<n>This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method.
arXiv Detail & Related papers (2026-02-28T10:53:33Z) - Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment [58.93227458806748]
CoVer-VLA is a hierarchical test-time verification pipeline using the trained verifier.<n>Our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model.<n>It repeatedly generates action candidates for each instruction, and then uses the verifier to select the optimal high-level prompt and low-level action chunks.
arXiv Detail & Related papers (2026-02-12T18:59:59Z) - ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models [57.75717492488268]
Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models.<n>Supervised Finetuning (SFT) requires hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training.<n>We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations.
arXiv Detail & Related papers (2025-12-16T18:26:38Z) - AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models [60.39655329875822]
Vision-Language-Action (VLA) models enable robots to interpret natural-language instructions and perform diverse tasks.<n>Despite growing interest in attacking such models, the effectiveness of existing techniques remains unclear.<n>We propose AttackVLA, a unified framework that aligns with the VLA development lifecycle.
arXiv Detail & Related papers (2025-11-15T10:30:46Z) - Learning Affordances at Inference-Time for Vision-Language-Action Models [50.93181349331096]
In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks.<n>We introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences.<n>Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution.
arXiv Detail & Related papers (2025-10-22T16:43:29Z) - Do What? Teaching Vision-Language-Action Models to Reject the Impossible [53.40183895299108]
Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks.<n>We propose Instruct-Verify-and-Act (IVA), a framework that detects when an instruction cannot be executed due to a false premise.<n>Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines.
arXiv Detail & Related papers (2025-08-22T10:54:33Z) - From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models [5.660635614478238]
Vision-Language-Action (VLA) models promise to produce versatile, "generalist" robot policies.<n>Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions.<n>We introduce a unified suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects.
arXiv Detail & Related papers (2025-06-11T16:52:18Z) - BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization [45.97834622654751]
BadVLA is a backdoor attack method based on Objective-Decoupled Optimization.<n>We show that BadVLA consistently achieves near-100% attack success rates with minimal impact on clean task accuracy.<n>Our work offers the first systematic investigation of backdoor vulnerabilities in VLA models.
arXiv Detail & Related papers (2025-05-22T13:12:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.