Related papers: Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

URL: http://arxiv.org/abs/2512.11218v1
Date: Fri, 12 Dec 2025 01:59:23 GMT
Title: Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
Authors: Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, Yue Wang,
Abstract summary: BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
Score: 59.44168425139687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.

Related papers

Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization [20.179748846776995]
We propose ActionVLM, a vision-language aggregation framework that mitigates modality bias in TAL.<n>Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial.<n>Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
arXiv Detail & Related papers (2026-01-28T22:03:46Z)
LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries [30.732526921367835]
LangForce is a novel framework that enforces instruction following via Bayesian decomposition.<n>We show that LangForce significantly improves generalization without requiring new data.
arXiv Detail & Related papers (2026-01-21T17:15:22Z)
Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification [17.948161564138033]
Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions.<n>But even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution scenarios.<n>We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment.
arXiv Detail & Related papers (2025-10-18T00:38:45Z)
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models [5.660635614478238]
Vision-Language-Action (VLA) models promise to produce versatile, "generalist" robot policies.<n>Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions.<n>We introduce a unified suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects.
arXiv Detail & Related papers (2025-06-11T16:52:18Z)
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.<n>Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.<n>We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z)
FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.<n>We present benchmarks to assess the model's ability to use image as substantive evidence.<n>We identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations.
arXiv Detail & Related papers (2024-12-19T09:24:10Z)
Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only. Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z)
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm. We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z)
Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference. BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations. To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.