Related papers: Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

URL: http://arxiv.org/abs/2511.00710v1
Date: Sat, 01 Nov 2025 21:19:41 GMT
Title: Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries
Authors: Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu,
Abstract summary: We introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning.<n>We leverage this controllable environment to train Vision-Language Models (VLMs) using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum.<n>Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%.
Score: 23.825984868116716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

Related papers

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards [51.45138356629732]
We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
arXiv Detail & Related papers (2026-03-02T18:07:53Z)
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models [73.10315509190623]
Recent reinforcement learning techniques have yielded impressive reasoning improvements in language models.<n>It remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training.<n>We develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training.
arXiv Detail & Related papers (2025-12-08T18:12:10Z)
Do Math Reasoning LLMs Help Predict the Impact of Public Transit Events? [6.428337528749318]
We introduce a tolerance-based, shaped reward function that grants partial credit within a continuous error margin, rather than demanding a single correct answer.<n>Our findings show that general-purpose, instruction-tuned LLMs significantly outperform specialized math-reasoning models.<n>This demonstrates that RLVR can be successfully adapted to real-world, noisy forecasting, but requires a verifier design that reflects the continuous nature of the problem.
arXiv Detail & Related papers (2025-11-02T05:21:33Z)
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning [93.19037653970622]
We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images.<n>Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities.<n>Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
arXiv Detail & Related papers (2025-10-31T16:30:08Z)
Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z)
Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches [2.9165586612027234]
We show that VLM guidance can significantly improve early-stage sample efficiency.<n>Our results provide a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.
arXiv Detail & Related papers (2025-09-24T09:25:15Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z)
Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning [58.62311540316617]
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL)<n>We propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually.<n>E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B)
arXiv Detail & Related papers (2025-06-07T02:41:54Z)
Decomposing Elements of Problem Solving: What "Math" Does RL Teach? [22.517954679764244]
We decompose problem solving into fundamental capabilities: Plan, Execute, and Verify.<n>We show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills.<n>Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers.
arXiv Detail & Related papers (2025-05-28T18:18:49Z)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z)
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [29.524164786422368]
Recently, DeepSeek R1 has shown that reinforcement learning can substantially improve the reasoning capabilities of Large Language Models (LLMs)<n>We investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs)<n>We develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks.
arXiv Detail & Related papers (2025-04-10T10:05:15Z)
Offline RLAIF: Piloting VLM Feedback for RL via SFO [4.391505380846452]
Vision-Language Models (VLMs) are limited in their ability to solve control tasks due to their lack of action-conditioned training data.<n>A key challenge in Reinforcement Learning from AI Feedback is determining how best to integrate VLM-derived signals into the learning process.
arXiv Detail & Related papers (2025-03-02T23:52:46Z)
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
Bootstrapping Reinforcement Learning with Imitation for Vision-Based Agile Flight [20.92646531472541]
We propose a novel approach that combines the performance of Reinforcement Learning (RL) and the sample efficiency of Imitation Learning (IL) Our framework contains three phases teacher policy using RL with privileged state information distilling it into a student policy via IL, and adaptive fine-tuning via RL. Tests show our approach can not only learn in scenarios where RL from scratch fails but also outperforms existing IL methods in both robustness and performance.
arXiv Detail & Related papers (2024-03-18T19:25:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.