Stable Language Guidance for Vision-Language-Action Models
- URL: http://arxiv.org/abs/2601.04052v1
- Date: Wed, 07 Jan 2026 16:16:10 GMT
- Title: Stable Language Guidance for Vision-Language-Action Models
- Authors: Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, Guangrun Wang,
- Abstract summary: Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
- Score: 62.80963701282789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbf{Residual Semantic Steering (RSS)}, a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbf{Monte Carlo Syntactic Integration}, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbf{Residual Affordance Steering}, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
Related papers
- CrystaL: Spontaneous Emergence of Visual Latents in MLLMs [55.34169914483764]
We propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images.<n>By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics.<n>Experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-24T15:01:30Z) - Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization [20.179748846776995]
We propose ActionVLM, a vision-language aggregation framework that mitigates modality bias in TAL.<n>Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial.<n>Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
arXiv Detail & Related papers (2026-01-28T22:03:46Z) - \ extsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation [50.027425808733994]
textscNaVIDA is a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution.<n>textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions.<n>Experiments show that textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters.
arXiv Detail & Related papers (2026-01-26T06:16:17Z) - Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs [9.043999205886658]
Hallucinations in large vision-language models often arise when language priors dominate over visual evidence.<n>We propose Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths.<n>ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost.
arXiv Detail & Related papers (2026-01-20T08:04:18Z) - Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy [75.66913260900726]
Reinforcement Learning with Verifiable Rewards has significantly advanced reasoning capabilities in Large Language Models.<n>Existing paradigms, driven by text-centric outcome rewards, encourage models to bypass visual perception.<n>We propose textbfThinking with Deltas, a framework driven by a textbfDifferential Visual Reasoning Policy.
arXiv Detail & Related papers (2026-01-11T08:25:34Z) - Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary [22.925582428795437]
BEAT is a framework that tokenizes user and item behaviors into discrete, interpretable sequences.<n>We show that BEAT improves zero-shot recommendation performance while generating coherent and informative explanations.
arXiv Detail & Related papers (2025-12-17T17:24:24Z) - When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools [41.993750134878766]
Video-STAR is a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition.<n>Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching.<n>Our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference.
arXiv Detail & Related papers (2025-10-09T17:20:44Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance [39.38656685766509]
Existing weakly supervised video violence detection methods rely on Euclidean representation learning.<n>We propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries.<n>Experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-04-26T09:29:10Z) - Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension [31.952192907460713]
Relation-R1 is the textitfirst unified relation comprehension framework.<n>It integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization ( GRPO)<n>Experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and textitN-ary relation understanding.
arXiv Detail & Related papers (2025-04-20T14:50:49Z) - Prompt-based Logical Semantics Enhancement for Implicit Discourse
Relation Recognition [4.7938839332508945]
We propose a Prompt-based Logical Semantics Enhancement (PLSE) method for Implicit Discourse Relation Recognition (IDRR)
Our method seamlessly injects knowledge relevant to discourse relation into pre-trained language models through prompt-based connective prediction.
Experimental results on PDTB 2.0 and CoNLL16 datasets demonstrate that our method achieves outstanding and consistent performance against the current state-of-the-art models.
arXiv Detail & Related papers (2023-11-01T08:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.