Related papers: VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs

URL: http://arxiv.org/abs/2406.14596v3
Date: Thu, 31 Oct 2024 05:38:39 GMT
Title: VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs
Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki,
Abstract summary: Large-scale generative language and vision-language models excel in in-context learning for decision making. We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience from sub-optimal demonstrations and human feedback. Our approach significantly reduces reliance on manual prompt engineering and consistently outperforms in-context learning from action plans that lack such abstractions.
Score: 38.03704123835915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale generative language and vision-language models excel in in-context learning for decision making. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience from sub-optimal demonstrations and human feedback. Given a task demonstration that may contain inefficiencies or mistakes, a VLM abstracts the trajectory into a generalized program by correcting inefficient actions and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These abstractions are iteratively improved through human feedback while the agent attempts to execute the trajectory. The resulting examples, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Moreover, as the agent's library of examples grows, it becomes more efficient, relying less on human feedback and requiring fewer environment interactions per demonstration. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7% using GPT4V. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on manual prompt engineering and consistently outperforms in-context learning from action plans that lack such abstractions.

Related papers

Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion [25.84673296137996]
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years.<n>However, their performance still falls short in tasks requiring deep visual perception.<n>We introduce a self-improvement framework grounded in a novel visual knowledge-intensive task.
arXiv Detail & Related papers (2025-08-06T13:54:49Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting [40.837048280287206]
Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language.<n>Current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss.<n>We develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost.
arXiv Detail & Related papers (2025-07-07T15:30:55Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [55.97950660659051]
We aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation)<n>We introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step.<n>Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively.
arXiv Detail & Related papers (2025-04-10T17:41:56Z)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs) Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs. We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. We present a new advanced VLA architecture derived from Vision-Language-Models (VLM) We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z)
MTLSO: A Multi-Task Learning Approach for Logic Synthesis Optimization [19.13500546022262]
MTLSO is a Multi-Task Learning approach for Logic Synthesis Optimization. We introduce an auxiliary task of binary multi-label graph classification alongside the primary regression task. We also employ a hierarchical graph representation learning strategy to improve the model's capacity for learning expressive graph-level representations.
arXiv Detail & Related papers (2024-09-09T21:20:36Z)
Sparse Rewards Can Self-Train Dialogue Agents [22.799506097310008]
We introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks.
arXiv Detail & Related papers (2024-09-06T21:00:57Z)
Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning [6.06616040517684]
DAAG hindsight relabels the agent's past experience by using diffusion models to transform videos. Large language model orchestrates this autonomous process without requiring human supervision. Results show that DAAG improves learning of reward detectors, transferring past experience, and acquiring new tasks.
arXiv Detail & Related papers (2024-07-30T13:01:31Z)
Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images. This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z)
V-RECS, a Low-Cost LLM4VIS Recommender with Explanations, Captioning and Suggestions [3.3235895997314726]
We present V-RECS, the first Visual Recommender augmented with explanations(E), captioning(C), and suggestions(S) for further data exploration. V-RECS' visualization narratives facilitate both response verification and data exploration by non-expert users.
arXiv Detail & Related papers (2024-06-21T15:50:10Z)
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback [54.10302745921713]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z)
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z)
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs) Ours is a novel approach to extend the utility of LLMs in the context of video tasks. We harness their contextual learning capabilities to generate executable visual programs for video understanding.
arXiv Detail & Related papers (2024-03-21T18:00:00Z)
Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z)
A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.