Related papers: SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

URL: http://arxiv.org/abs/2506.08989v1
Date: Tue, 10 Jun 2025 17:02:00 GMT
Title: SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning
Authors: Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen,
Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks.<n>We introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation.<n>SwS enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models.
Score: 95.28059121743831
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

Related papers

Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark [0.0]
The Frame Problem and the Symbol Grounding Problem have historically been viewed as unsolvable within traditional symbolic AI systems.<n>This study investigates whether modern LLMs possess the cognitive capacities required to address these problems.
arXiv Detail & Related papers (2025-06-09T16:12:47Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning [19.457621121430464]
Training large reasoning models (LRMs) with reinforcement learning in STEM domains is hindered by the scarcity of high-quality, diverse, and verifiable problem sets.<n>We introduce SHARP, a unified approach to Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards (RLVR)<n>We implement SHARP by leveraging a state-of-the-art LRM to infer and verify challenging STEM questions, then employ a reinforcement learning loop to refine the model's reasoning through verifiable reward signals.
arXiv Detail & Related papers (2025-05-20T09:54:42Z)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z)
Self Rewarding Self Improving [0.0]
We demonstrate that large language models can effectively self-improve through self-judging without requiring reference solutions.<n>Our experiments on Countdown puzzles and MIT Integration Bee problems show that models can provide reliable reward signals without ground truth answers.
arXiv Detail & Related papers (2025-05-12T23:51:04Z)
Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations [4.39614901077936]
Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities.<n>The Abstraction and Reasoning Corpus benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems.<n>This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios.
arXiv Detail & Related papers (2025-04-22T13:43:58Z)
Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs)<n>This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context.<n>We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Explainable AI-aided Feature Selection and Model Reduction for DRL-based V2X Resource Allocation [18.49800990388549]
Artificial intelligence (AI) is expected to significantly enhance radio resource management (RRM) in sixth-generation (6G) networks.<n>The lack of explainability in complex deep learning (DL) models poses a challenge for practical implementation.<n>This paper proposes a novel explainable AI (XAI)-based framework for feature selection and model complexity reduction.
arXiv Detail & Related papers (2025-01-23T10:55:38Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z)
Sirius: Contextual Sparsity with Correction for Efficient LLMs [17.433112174650514]
Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks.
arXiv Detail & Related papers (2024-09-05T18:38:07Z)
AttNS: Attention-Inspired Numerical Solving For Limited Data Scenarios [51.94807626839365]
We propose the attention-inspired numerical solver (AttNS) to solve differential equations due to limited data.<n>AttNS is inspired by the effectiveness of attention modules in Residual Neural Networks (ResNet) in enhancing model generalization and robustness.
arXiv Detail & Related papers (2023-02-05T01:39:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.