Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
- URL: http://arxiv.org/abs/2601.22975v2
- Date: Mon, 02 Feb 2026 21:54:20 GMT
- Title: Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
- Authors: Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, Hyunwoo Kim, Prithviraj Ammanabrolu, Jan Kautz, Yi Dong, Yejin Choi,
- Abstract summary: Golden Goose is a trick to synthesize unlimited RLVR tasks from unverifiable internet text.<n>We leverage reasoning-rich unverifiable corpora to synthesize GooseReason-0.7M, a large-scale RLVR dataset.<n>We also deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain.
- Score: 90.21130870640057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
Related papers
- LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards [51.45138356629732]
We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
arXiv Detail & Related papers (2026-03-02T18:07:53Z) - Detecting RLVR Training Data via Structural Convergence of Reasoning [31.260852555788205]
Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models.<n>We show that RLVR induces a distinctive behavioral signature.<n>We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse.
arXiv Detail & Related papers (2026-02-12T10:17:32Z) - Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning [82.91265691530351]
A$2$D is an Adaptive Ability Decomposing method for enhancing the effectiveness ofReinforcement Learning with verifiable rewards.<n>We first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions.<n>Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance.
arXiv Detail & Related papers (2026-01-31T14:48:23Z) - Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards [13.064343544668283]
We propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering"<n>We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500.
arXiv Detail & Related papers (2025-11-21T18:23:04Z) - CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - Reinforcement Learning with Rubric Anchors [26.9944158097067]
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs)<n>We extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards.<n>We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration.
arXiv Detail & Related papers (2025-08-18T10:06:08Z) - QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation [51.393569044134445]
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification.<n> Extending RLVR to automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges.<n>We introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs.
arXiv Detail & Related papers (2025-05-30T03:51:06Z) - Absolute Zero: Reinforced Self-play Reasoning with Zero Data [57.30662797376754]
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models.<n>We introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability.<n>AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models.
arXiv Detail & Related papers (2025-05-06T09:08:00Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.