SPICE: Self-Play In Corpus Environments Improves Reasoning
- URL: http://arxiv.org/abs/2510.24684v1
- Date: Tue, 28 Oct 2025 17:46:16 GMT
- Title: SPICE: Self-Play In Corpus Environments Improves Reasoning
- Authors: Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston,
- Abstract summary: SPICE is a reinforcement learning framework where a single model acts in two roles.<n>A Challenger mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them.<n>Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals.
- Score: 58.78992702325821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.
Related papers
- Survival is the Only Reward: Sustainable Self-Training Through Environment-Mediated Selection [0.27087606206363224]
This paper provides a proof-of-concept system architecture for stable self-training under sparse external feedback and bounded memory.<n>We introduce a self-training architecture in which learning is mediated exclusively by environmental viability, rather than by reward, objective functions, or externally defined fitness criteria.
arXiv Detail & Related papers (2026-01-18T08:35:56Z) - AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models [75.214287449744]
We introduce a framework for post-training policy refinement built around an Impartial World Model.<n>Our primary contribution is to teach this model to be honest about danger.<n>We demonstrate through extensive experiments, that our model significantly outperforms baselines in predicting failures.
arXiv Detail & Related papers (2025-11-25T13:57:24Z) - Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.2144357080404]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z) - On the Convergence of Moral Self-Correction in Large Language Models [26.724972162483855]
Large Language Models (LLMs) are able to improve their responses when instructed to do so.<n>LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction.<n>We reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions.
arXiv Detail & Related papers (2025-10-08T17:46:27Z) - CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards [53.36917093757101]
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs)<n>We introduce textbfCogDual, a novel RPLA adopting a textitcognize-then-respond reasoning paradigm.<n>By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment.
arXiv Detail & Related papers (2025-07-23T02:26:33Z) - SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents [58.174206358223415]
Self-Evolving Embodied Agents-R1, or SEEA-R1, is the first reinforcement fine-tuning framework designed for self-evolving embodied agents.<n>We show that SEEA-R1 can support autonomous adaptation and reward-driven self-evolution.
arXiv Detail & Related papers (2025-06-26T18:00:07Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model [0.0]
In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration) as a model-based Reinforcement Learning framework.<n>By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without task specific feedback.<n> Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines.
arXiv Detail & Related papers (2025-03-07T16:56:00Z) - Regularity as Intrinsic Reward for Free Play [24.29379265146469]
We propose regularity as a novel reward signal for intrinsically-motivated reinforcement learning.
Our generalized formulation of Regularity as Intrinsic Reward (RaIR) allows us to operationalize it within model-based reinforcement learning.
arXiv Detail & Related papers (2023-12-03T18:18:44Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - REAL-X -- Robot open-Ended Autonomous Learning Architectures: Achieving
Truly End-to-End Sensorimotor Autonomous Learning Systems [0.0]
We study the challenges posed by the previously proposed benchmark REAL competition'
We present a set of REAL-X' robot architectures that are able to solve different versions of the benchmark.
arXiv Detail & Related papers (2020-11-27T18:12:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.