Self-Questioning Language Models
- URL: http://arxiv.org/abs/2508.03682v2
- Date: Wed, 06 Aug 2025 17:23:53 GMT
- Title: Self-Questioning Language Models
- Authors: Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak,
- Abstract summary: We propose an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver.<n>Both the proposer and solver are trained via reinforcement learning.<n>We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces.
- Score: 51.75087358141567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can large language models improve without external data -- by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.
Related papers
- SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models [4.328173053224842]
This paper introduces SQuARE, a novel prompting technique designed to improve reasoning through a self-interrogation paradigm.<n>Building upon CoT frameworks, SQuARE prompts models to generate and resolve multiple auxiliary questions before tackling the main query.<n>Our evaluations, conducted with Llama 3 and GPT-4o models across multiple question-answering datasets, demonstrate that SQuARE significantly surpasses traditional CoT prompts and existing rephrase-and-respond methods.
arXiv Detail & Related papers (2025-02-13T15:07:20Z) - CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts.
This benchmark is difficult, with the best model only scoring 58.1% in standard settings.
We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z) - Frontier Language Models are not Robust to Adversarial Arithmetic, or
"What do I need to say so you agree 2+2=5? [88.59136033348378]
We study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment.
This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete.
We show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops.
arXiv Detail & Related papers (2023-11-08T19:07:10Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Language Models Can Teach Themselves to Program Better [4.627023679353507]
Recent Language Models (LMs) achieve breakthrough performance in code generation when trained on human-authored problems.
We show that it is possible for an LM to synthesize programming problems and solutions, which are filtered for correctness by a Python interpreter.
The LM's performance is then seen to improve when it is fine-tuned on its own synthetic problems and verified solutions.
arXiv Detail & Related papers (2022-07-29T06:43:28Z) - Automatic Short Math Answer Grading via In-context Meta-learning [2.0263791972068628]
We study the problem of automatic short answer grading for students' responses to math questions.
We use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model.
Second, we use an in-context learning approach that provides scoring examples as input to the language model.
arXiv Detail & Related papers (2022-05-30T16:26:02Z) - Zero-shot Commonsense Question Answering with Cloze Translation and
Consistency Optimization [20.14487209460865]
We investigate four translation methods that can translate natural questions into cloze-style sentences.
We show that our methods are complementary datasets to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2022-01-01T07:12:49Z) - A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data.
We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.