Vision-Language Models Unlock Task-Centric Latent Actions
- URL: http://arxiv.org/abs/2601.22714v1
- Date: Fri, 30 Jan 2026 08:38:59 GMT
- Title: Vision-Language Models Unlock Task-Centric Latent Actions
- Authors: Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Lyubaykin Nikita, Vladislav Kurenkov,
- Abstract summary: We propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations.<n>We show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
- Score: 75.53481518882275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
Related papers
- VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents [96.01507640637534]
We introduce VisGym, a gymnasium of 17 environments for evaluating and training Modern Vision-Language Models (VLMs)<n>The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback.<n>Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations.
arXiv Detail & Related papers (2026-01-23T18:43:34Z) - DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models [25.91822750707556]
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation.<n>VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'<n>This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks.
arXiv Detail & Related papers (2026-01-22T16:02:56Z) - VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models [43.09726338623949]
Vision-Language-Action (VLA) models integrate pretrained large Vision-Language Models (VLM) into their policy backbone.<n>This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance.<n>We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters.
arXiv Detail & Related papers (2026-01-06T09:58:24Z) - WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z) - Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization [42.41263928527529]
Vision-Language-Action (VLA) models can endow agents with transferable world knowledge and vision-language grounding.<n>Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original visual representations and knowledge are preserved.<n>We conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations.
arXiv Detail & Related papers (2025-10-29T15:20:10Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought [41.72701516732208]
Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot learning but require high-quality demonstrations.<n>We propose In-Context Abstraction Learning (ICAL), enabling VLM agents to transform suboptimal trajectories into high-quality training data.
arXiv Detail & Related papers (2024-06-20T17:45:02Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.