Related papers: Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

URL: http://arxiv.org/abs/2601.22701v1
Date: Fri, 30 Jan 2026 08:22:18 GMT
Title: Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference
Authors: Emilien Biré, María Santos, Kai Yuan,
Abstract summary: Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments.<n>These models suffer from inadaptability to fast-changing environments like the web.<n>We introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining.
Score: 4.943575742796223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM's role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.

Related papers

Demonstration-Free Robotic Control via LLM Agents [0.0]
We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification.<n>With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively.<n>Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning.
arXiv Detail & Related papers (2026-01-28T07:49:35Z)
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents [40.73340280747757]
The ReAct capability in large language models (LLMs) has become the foundation of modern agentic systems.<n>We introduce Pre-Act, a novel approach that enhances the agent's performance by creating a multi-step execution plan.<n>Our approach is applicable to both conversational and non-conversational agents.
arXiv Detail & Related papers (2025-05-15T05:17:47Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model [55.276852838877346]
Self-evolving agents are trained on trajectories sampled autonomously based on their own policies.<n>We propose a novel framework that introduces a co-evolving World Model LLM.<n>This world model predicts the next observation based on the current observation and action within the web environment.
arXiv Detail & Related papers (2025-04-23T02:54:31Z)
Digi-Q: Learning Q-Value Functions for Training Device-Control Agents [73.60512136881279]
Digi-Q trains VLM-based action-value Q-functions which are then used to extract the agent policy.<n> Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild.
arXiv Detail & Related papers (2025-02-13T18:55:14Z)
Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models [22.43652231336764]
We propose leveraging a task-relevant Q-value model to guide action selection. We show that Q-value models significantly improve their performance.
arXiv Detail & Related papers (2024-09-14T07:32:49Z)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities. We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z)
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents. We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.