Related papers: VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

URL: http://arxiv.org/abs/2505.03181v1
Date: Tue, 06 May 2025 04:51:57 GMT
Title: VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making
Authors: Jake Grigsby, Yuke Zhu, Michael Ryoo, Juan Carlos Niebles,
Abstract summary: Vision-language models (VLMs) extend large language models (LLMs) to multi-modal data.<n>Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective.
Score: 45.02997774119763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data and provide agents with the visual reasoning necessary for new applications in areas such as computer automation. However, agent tasks emphasize skills where accessible open-weight VLMs lag behind their LLM equivalents. For example, VLMs are less capable of following an environment's strict output syntax requirements and are more focused on open-ended question answering. Overcoming these limitations requires supervised fine-tuning (SFT) on task-specific expert demonstrations. Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective. RL lets us fine-tune VLMs to agent tasks while learning from the unsuccessful decisions of our own model or more capable (larger) models. We explore an off-policy RL solution that retains the stability and simplicity of the widely used SFT workflow while allowing our agent to self-improve and learn from low-quality datasets. We demonstrate this technique with two open-weight VLMs across three multi-modal agent domains.

Related papers

Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration [35.429026246760635]
BeMyEyes is a modular framework for extending Large Language Models (LLMs) to multimodal reasoning.<n>By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models.<n> Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs.
arXiv Detail & Related papers (2025-11-24T18:55:16Z)
Empowering Multimodal LLMs with External Tools: A Comprehensive Survey [61.66069828956139]
Multimodal Large Language Models (MLLMs) have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence.<n>Lack of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols hinder the reliability and broader applicability of MLLMs.<n>Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools offers a promising strategy to overcome these challenges.
arXiv Detail & Related papers (2025-08-14T07:25:45Z)
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering [27.812611421754482]
We propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering (VQA)<n>We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs.<n>Our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach.
arXiv Detail & Related papers (2025-03-01T09:25:23Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z)
Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities. We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
arXiv Detail & Related papers (2024-10-03T23:40:21Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
Reinforcement Learning Problem Solving with Large Language Models [0.0]
Large Language Models (LLMs) have an extensive amount of world knowledge, and this has enabled their application in various domains to improve the performance of Natural Language Processing (NLP) tasks. This has also facilitated a more accessible paradigm of conversation-based interactions between humans and AI systems to solve intended problems. We show the practicality of our approach through two detailed case studies for "Research Scientist" and "Legal Matter Intake"
arXiv Detail & Related papers (2024-04-29T12:16:08Z)
Large Language Model based Multi-Agents: A Survey of Progress and Challenges [44.92286030322281]
Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation.
arXiv Detail & Related papers (2024-01-21T23:36:14Z)
O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language Models [16.91329676173649]
Offline Data-driven Discovery and Distillation (O3D) is proposed to improve large language models (LLMs) O3D automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) verify that O3D can notably enhance the decision-making capabilities of LLMs.
arXiv Detail & Related papers (2023-10-22T20:28:33Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach [31.6589518077397]
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets. LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions. We propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions.
arXiv Detail & Related papers (2023-06-06T11:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.