Related papers: QueST: Incentivizing LLMs to Generate Difficult Problems

QueST: Incentivizing LLMs to Generate Difficult Problems

URL: http://arxiv.org/abs/2510.17715v1
Date: Mon, 20 Oct 2025 16:29:53 GMT
Title: QueST: Incentivizing LLMs to Generate Difficult Problems
Authors: Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei,
Abstract summary: Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
Score: 77.75835742350644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.

Related papers

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations [5.275682987885503]
We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks.<n>We show that models encode a model-specific notion of difficulty that is distinct from human difficulty.<n>We demonstrate that routing queries across a pool of models can exceed the best-performing model.
arXiv Detail & Related papers (2026-02-10T15:57:00Z)
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models [54.29243291958429]
We develop a problem generator that reasons explicitly to plan problem directions before synthesis.<n>We treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty.<n>Our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models.
arXiv Detail & Related papers (2025-11-13T03:08:51Z)
UniCode: A Framework for Generating High Quality Competitive Coding Problems [41.66698149759178]
UniCode is a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases.<n>We show that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%.
arXiv Detail & Related papers (2025-10-16T05:07:12Z)
ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning [51.946959481392064]
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving.<n>We propose ScaleDiff, a pipeline designed to scale the creation of difficult problems.<n>We show that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models.
arXiv Detail & Related papers (2025-09-25T12:22:44Z)
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z)
Synthesis by Design: Controlled Data Generation via Structural Guidance [7.938713951512933]
We propose to extract structural information with generated problem-solving code from mathematical reasoning.<n>Our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty.<n>Results on our benchmark show that model performance declines as reasoning length increases.
arXiv Detail & Related papers (2025-06-09T11:38:23Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.<n>We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)<n>We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.<n>Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
Evaluating Language Models as Synthetic Data Generators [99.16334775127875]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.<n>Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
Training and Evaluating Language Models with Template-based Data Generation [5.980612601840882]
We introduce TDG, a novel paradigm that harnesses frontier LLMs (GPT-4) to automatically generate meta-templates, which in turn synthesize a virtually infinite stream of problems and solutions.<n>Our approach data augmentation by employing GPT-4 for meta-template creation, guaranteeing diverse and complex problem structures.
arXiv Detail & Related papers (2024-11-27T07:32:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.