Related papers: Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

URL: http://arxiv.org/abs/2512.01661v1
Date: Mon, 01 Dec 2025 13:32:59 GMT
Title: Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems
Authors: Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin, Zheng Yan, Jinhao Liu, Jianshu Zhang, Wanxiang Che,
Abstract summary: We propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability.<n>Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology.<n>Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty.
Score: 51.62477754641947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA.

Related papers

Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks [8.210112631285666]
Large language models (LLMs) have demonstrated strong performance on formal language tasks.<n>We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages.<n>We show that models achieve perfect accuracy on factual questions and 84-90% on seen tasks, but accuracy drops sharply on unseen problems.
arXiv Detail & Related papers (2026-01-19T21:00:31Z)
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction [57.67217258741752]
Thinker is a hierarchical thinking model for deep search through multi-turn interaction.<n>It decomposes complex problems into independently solvable sub-problems.<n> dependencies between sub-problems are passed as parameters via these logical functions.
arXiv Detail & Related papers (2025-11-11T07:48:45Z)
ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models [70.33764118171463]
Large Language Models (LLMs) tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability.<n>We develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems.<n>LLMs fail to directly identify unsolvable problems and always generate fabricated responses.
arXiv Detail & Related papers (2025-07-03T19:19:44Z)
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research [0.6286531904189063]
Approaches to scaling AI supervision include debate, critique, and prover-verifier games.<n>We present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language.<n>We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments.
arXiv Detail & Related papers (2025-03-29T06:38:30Z)
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [17.056693711040747]
We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events.<n>This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning.<n>The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
arXiv Detail & Related papers (2025-02-13T00:18:34Z)
VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning [26.61758685366347]
We develop a benchmark called Problems with Missing and Contradictory conditions (PMC) containing over 5,000 validated ill-defined mathematical problems.<n>VCSEARCH improves the accuracy of identifying unsolvable problems by at least 12% across different large language models.
arXiv Detail & Related papers (2024-06-07T16:24:12Z)
Learning Task Decomposition to Assist Humans in Competitive Programming [90.4846613669734]
We introduce a novel objective for learning task decomposition, termed value (AssistV)<n>We collect a dataset of human repair experiences on different decomposed solutions.<n>Under 177 hours of human study, our method enables non-experts to solve 33.3% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
arXiv Detail & Related papers (2024-06-07T03:27:51Z)
MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting.<n>We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems.<n>We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z)
Nonuniqueness and Convergence to Equivalent Solutions in Observer-based Inverse Reinforcement Learning [0.8943924354248618]
Key challenge in solving the deterministic inverse reinforcement learning (IRL) problem online and in real-time is the existence of multiple solutions. Nonuniqueness necessitates the study of the notion of equivalent solutions. Regularized history stack observer that converges to approximately equivalent solutions of the IRL problem is developed.
arXiv Detail & Related papers (2022-10-28T17:52:18Z)
A Mutual Information Maximization Approach for the Spurious Solution Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals. There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance. We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.