Related papers: Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

URL: http://arxiv.org/abs/2502.15760v1
Date: Thu, 13 Feb 2025 18:55:14 GMT
Title: Digi-Q: Learning Q-Value Functions for Training Device-Control Agents
Authors: Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar,
Abstract summary: Digi-Q trains VLM-based action-value Q-functions which are then used to extract the agent policy.<n> Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild.
Score: 73.60512136881279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq

Related papers

Bootstrapped Model Predictive Control [19.652808098339644]
We introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Our method achieves superior performance over prior works on diverse continuous control tasks.
arXiv Detail & Related papers (2025-03-24T16:46:36Z)
Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions [18.643104368680593]
In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. We introduce a new actor architecture that combines two simple insights: (i) use multiple actors and evaluate the Q-value maximizing action, and (ii) learn surrogates to the Q-function that are simpler to optimize with gradient-based methods.
arXiv Detail & Related papers (2024-10-15T17:58:03Z)
Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [60.05963742334746]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions [143.89572689302497]
We present a scalable reinforcement learning method for training multi-task policies from large offline datasets. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite.
arXiv Detail & Related papers (2023-09-18T21:00:38Z)
Hypernetworks for Zero-shot Transfer in Reinforcement Learning [21.994654567458017]
Hypernetworks are trained to generate behaviors across a range of unseen task conditions. This work relates to meta RL, contextual RL, and transfer learning. Our method demonstrates significant improvements over baselines from multitask and meta RL approaches.
arXiv Detail & Related papers (2022-11-28T15:48:35Z)
Robot Learning of Mobile Manipulation with Reachability Behavior Priors [38.49783454634775]
Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. We study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks.
arXiv Detail & Related papers (2022-03-08T12:44:42Z)
Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.