Related papers: LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

URL: http://arxiv.org/abs/2601.15197v4
Date: Tue, 27 Jan 2026 14:51:48 GMT
Title: LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Authors: Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen,
Abstract summary: LangForce is a novel framework that enforces instruction following via Bayesian decomposition.<n>We show that LangForce significantly improves generalization without requiring new data.
Score: 30.732526921367835
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

Related papers

Point What You Mean: Visually Grounded Instruction Policy [42.52502990975079]
Point-VLA is a plug-and-play policy that augments language instructions with explicit visual cues to resolve referential ambiguity.<n>We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs.
arXiv Detail & Related papers (2025-12-22T00:44:19Z)
Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
arXiv Detail & Related papers (2025-12-12T01:59:23Z)
Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification [17.948161564138033]
Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions.<n>But even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution scenarios.<n>We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment.
arXiv Detail & Related papers (2025-10-18T00:38:45Z)
FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks [45.65159253753118]
This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations.<n>We provide language feedback embeddings as part of the input sequence into a Transformer-based policy.<n>We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment.
arXiv Detail & Related papers (2025-10-13T11:55:21Z)
Do What? Teaching Vision-Language-Action Models to Reject the Impossible [53.40183895299108]
Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks.<n>We propose Instruct-Verify-and-Act (IVA), a framework that detects when an instruction cannot be executed due to a false premise.<n>Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines.
arXiv Detail & Related papers (2025-08-22T10:54:33Z)
HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction [55.00788339683146]
We propose a novel Hierarchical vision-Language collaboration framework for improved survival prediction.<n> Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels.<n>This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts.
arXiv Detail & Related papers (2025-07-07T02:06:25Z)
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models [5.660635614478238]
Vision-Language-Action (VLA) models promise to produce versatile, "generalist" robot policies.<n>Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions.<n>We introduce a unified suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects.
arXiv Detail & Related papers (2025-06-11T16:52:18Z)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [63.54377402784965]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)<n>Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners.<n> Experiments on both the discrete environments (R2R, REVERIE, and R4R dataset) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter [59.69563889773648]
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place.<n>Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets.<n>We propose an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer.
arXiv Detail & Related papers (2025-03-12T14:20:33Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences. In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
Skill Induction and Planning with Latent Language [94.55783888325165]
We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks. In trained models, the space of natural language commands indexes a library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals.
arXiv Detail & Related papers (2021-10-04T15:36:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.