Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches
- URL: http://arxiv.org/abs/2509.19924v1
- Date: Wed, 24 Sep 2025 09:25:15 GMT
- Title: Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches
- Authors: Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber,
- Abstract summary: We show that VLM guidance can significantly improve early-stage sample efficiency.<n>Our results provide a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.
- Score: 2.9165586612027234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploration in reinforcement learning (RL) remains challenging, particularly in sparse-reward settings. While foundation models possess strong semantic priors, their capabilities as zero-shot exploration agents in classic RL benchmarks are not well understood. We benchmark LLMs and VLMs on multi-armed bandits, Gridworlds, and sparse-reward Atari to test zero-shot exploration. Our investigation reveals a key limitation: while VLMs can infer high-level objectives from visual input, they consistently fail at precise low-level control: the "knowing-doing gap". To analyze a potential bridge for this gap, we investigate a simple on-policy hybrid framework in a controlled, best-case scenario. Our results in this idealized setting show that VLM guidance can significantly improve early-stage sample efficiency, providing a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.
Related papers
- Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning [18.215893951726166]
In environments with sparse or delayed rewards, reinforcement learning incurs high sample complexity.<n>This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance.<n>We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts.
arXiv Detail & Related papers (2026-02-20T01:44:35Z) - Found-RL: foundation model-enhanced reinforcement learning for autonomous driving [15.275134927543611]
Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD)<n>Found-RL is a platform tailored to efficiently enhance RL for AD using foundation models.<n>A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop.
arXiv Detail & Related papers (2026-02-11T02:56:04Z) - Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z) - Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries [23.825984868116716]
We introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning.<n>We leverage this controllable environment to train Vision-Language Models (VLMs) using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum.<n>Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%.
arXiv Detail & Related papers (2025-11-01T21:19:41Z) - Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning [49.290631188365786]
Scaf-GRPO is a training framework that intervenes when a model's independent learning has plateaued.<n>It boosts the pass@1 score of the Qwen2.5-Math-7B model by a relative 44.3% over a vanilla GRPO baseline.<n>This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach.
arXiv Detail & Related papers (2025-10-22T17:41:30Z) - A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA [65.38186593873313]
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise.<n>We introduce a proof-of-concept multi-call framework for MHQA, InfoQA.<n>We construct a stringent and noise-rich benchmark to validate our theory and framework.
arXiv Detail & Related papers (2025-09-25T14:11:57Z) - Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [54.70676039314542]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants.
arXiv Detail & Related papers (2025-08-20T17:59:51Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization [45.799380822683034]
We present an extensive study aimed at advancing RL-based finetuning techniques for Large Language Models (LLMs)<n>We highlight key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others.<n>To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards.
arXiv Detail & Related papers (2025-07-03T11:52:45Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - An Empirical Study of Automated Vulnerability Localization with Large Language Models [21.84971967029474]
Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored.
Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models.
We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning.
arXiv Detail & Related papers (2024-03-30T08:42:10Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.