Related papers: ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance

ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance

URL: http://arxiv.org/abs/2601.16667v1
Date: Fri, 23 Jan 2026 11:31:07 GMT
Title: ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance
Authors: Zhuohao Li, Yinghao Li, Jian-Jian Jiang, Lang Zhou, Tianyu Zhang, Wei-Shi Zheng,
Abstract summary: We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
Score: 50.05984919728878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with VLM-encoded vision-language features, resulting in state-dominant bias and false completions despite visible execution failures. We attribute this to modality imbalance, where policies over-rely on internal state while underusing visual evidence. To address this, we present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary task-aware environment priors to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations, which drive a Vision-Proprioception Feature-wise Linear Modulation to enhance environmental awareness and reduce state-driven errors. Moreover, to evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop. Extensive experiments show that ReViP effectively reduces false-completion rates and improves success rates over strong VLA baselines on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.

Related papers

VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation [22.921677603408188]
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications.<n>We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation.<n>VAUQ explicitly measures how strongly a model's output depends on visual evidence.
arXiv Detail & Related papers (2026-02-24T16:11:14Z)
VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models [26.542479606920423]
Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks.<n>Despite the success, extending large pretrained VLA models to the action space can induce vision-action misalignment.<n>We propose a training framework that explicitly strengthens visual conditioning in VLA models.
arXiv Detail & Related papers (2026-02-04T20:59:29Z)
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models [21.133970394496327]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control.<n>Current test-time scaling (TTS) methods require additional training, verifiers, and multiple forward passes, making them impractical for deployment.<n>We propose a simple inference strategy that jointly modulates visual perception and action based on'self-uncertainty'
arXiv Detail & Related papers (2026-02-04T04:48:16Z)
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z)
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models [13.32858759983739]
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs.<n>Existing inference-time interventions to mitigate this issue present a challenging trade-off.<n>We present Residual-Update Directed DEcoding Regulation (RUDDER), a framework that steers LVLMs towards visually-grounded generation.
arXiv Detail & Related papers (2025-11-13T13:29:38Z)
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization [42.41263928527529]
Vision-Language-Action (VLA) models can endow agents with transferable world knowledge and vision-language grounding.<n>Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original visual representations and knowledge are preserved.<n>We conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations.
arXiv Detail & Related papers (2025-10-29T15:20:10Z)
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model [61.29164681694533]
ViPER is a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction.<n>Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability.
arXiv Detail & Related papers (2025-10-28T10:42:57Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models [34.60772103760521]
We introduce a novel framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs)<n>This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery.
arXiv Detail & Related papers (2025-05-27T04:53:50Z)
Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations [11.474045796965056]
Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH)<n>We propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs.
arXiv Detail & Related papers (2025-05-23T12:29:00Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.<n>Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.<n>We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z)
Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.