Learning Affordances at Inference-Time for Vision-Language-Action Models
- URL: http://arxiv.org/abs/2510.19752v1
- Date: Wed, 22 Oct 2025 16:43:29 GMT
- Title: Learning Affordances at Inference-Time for Vision-Language-Action Models
- Authors: Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A. Seshia, Sergey Levine,
- Abstract summary: In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks.<n>We introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences.<n>Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution.
- Score: 50.93181349331096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.
Related papers
- Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models [7.802379200026965]
We propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state.<n>Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators.
arXiv Detail & Related papers (2026-03-05T13:14:41Z) - Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures [14.313346858887286]
Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization.<n>This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT)<n>We propose Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback.
arXiv Detail & Related papers (2026-03-01T11:41:22Z) - EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models [57.75717492488268]
Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models.<n>Supervised Finetuning (SFT) requires hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training.<n>We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations.
arXiv Detail & Related papers (2025-12-16T18:26:38Z) - Hierarchical Vision Language Action Model Using Success and Failure Demonstrations [60.82332413442677]
We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning from low-level control.<n>System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction.<n>System 1 executes low-level actions without modifying the agent's core skills.
arXiv Detail & Related papers (2025-12-03T15:58:38Z) - Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z) - Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction [51.130510883952546]
Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control.<n>We propose textbfIntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism.<n>Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning.
arXiv Detail & Related papers (2025-10-09T04:49:46Z) - EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making [9.228654390917123]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making.<n>We propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty.<n>We show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines.
arXiv Detail & Related papers (2025-08-13T07:59:29Z) - Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation [36.17444261325021]
Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions.<n>Existing methods rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios.<n>We propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent's ability to identify objects from dynamic viewpoints in VLN scenarios without requiring VLM fine-tuning.
arXiv Detail & Related papers (2025-06-18T11:43:50Z) - From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models [5.660635614478238]
Vision-Language-Action (VLA) models promise to produce versatile, "generalist" robot policies.<n>Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions.<n>We introduce a unified suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects.
arXiv Detail & Related papers (2025-06-11T16:52:18Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Validity Learning on Failures: Mitigating the Distribution Shift in Autonomous Vehicle Planning [2.3558144417896583]
The planning problem constitutes a fundamental aspect of the autonomous driving framework.
We propose Validity Learning on Failures, VL(on failure) as a remedy to address this issue.
We show that VL(on failure) outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2024-06-03T17:25:18Z) - From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world.
Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting.
We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.