Related papers: CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

URL: http://arxiv.org/abs/2602.01660v1
Date: Mon, 02 Feb 2026 05:28:26 GMT
Title: CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation
Authors: Zhongyuan Peng, Caijun Xu, Changyi Xiao, Shibo Hong, Eli Zhang, Stephen Huang, Yixin Cao,
Abstract summary: Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions.<n>Existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale.<n>We propose CoDiQ, a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability.
Score: 12.550135424877894
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model's ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.

Related papers

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability [129.1296673737603]
Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning.<n>A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution.<n>We propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity.
arXiv Detail & Related papers (2026-02-02T18:54:54Z)
TTCS: Test-Time Curriculum Synthesis for Self-Evolving [47.826209735956716]
Test-Time Training offers a promising way to improve the reasoning ability of large language models.<n>We propose TTCS, a co-evolving test-time training framework.<n>We show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks.
arXiv Detail & Related papers (2026-01-30T06:38:02Z)
QueST: Incentivizing LLMs to Generate Difficult Problems [77.75835742350644]
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
arXiv Detail & Related papers (2025-10-20T16:29:53Z)
UQ: Assessing Language Models on Unsolved Questions [149.46593270027697]
We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange.<n>UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers.<n>The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers.
arXiv Detail & Related papers (2025-08-25T01:07:59Z)
Advancing Question Generation with Joint Narrative and Difficulty Control [0.0]
We propose a strategy for Joint Narrative and Difficulty Control, enabling simultaneous control over these two attributes in the generation of reading comprehension questions.<n>Our evaluation provides preliminary evidence that this approach is feasible, though it is not effective across all instances.
arXiv Detail & Related papers (2025-06-07T14:26:11Z)
Understanding Complexity in VideoQA via Visual Program Generation [31.207902042321006]
We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA)<n>We experimentally show that humans struggle to predict which questions are difficult for machine learning models.<n>We extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.
arXiv Detail & Related papers (2025-05-19T17:55:14Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs [3.24692739098077]
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. We evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models.
arXiv Detail & Related papers (2024-06-24T22:09:50Z)
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA) We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z)
Guiding the Growth: Difficulty-Controllable Question Generation through Step-by-Step Rewriting [30.722526598633912]
We argue that Question Generation (QG) systems should have stronger control over the logic of generated questions. We propose a novel framework that progressively increases question difficulty through step-by-step rewriting.
arXiv Detail & Related papers (2021-05-25T06:43:13Z)
KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base [67.87878113432723]
We introduce KQA Pro, a dataset for Complex KBQA including 120K diverse natural language questions. For each question, we provide the corresponding KoPL program and SPARQL query, so that KQA Pro serves for both KBQA and semantic parsing tasks.
arXiv Detail & Related papers (2020-07-08T03:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.