Learning to Solve Complex Problems via Dataset Decomposition
- URL: http://arxiv.org/abs/2602.20296v1
- Date: Mon, 23 Feb 2026 19:25:40 GMT
- Title: Learning to Solve Complex Problems via Dataset Decomposition
- Authors: Wanru Zhao, Lucas Caccia, Zhengyan Shi, Minseon Kim, Weijia Xu, Alessandro Sordoni,
- Abstract summary: This research explores a reverse curriculum generation approach that decomposes complex datasets into simpler, more learnable components.<n>We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to generate easier versions of examples.
- Score: 53.1641602054716
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Curriculum learning is a class of training strategies that organizes the data being exposed to a model by difficulty, gradually from simpler to more complex examples. This research explores a reverse curriculum generation approach that recursively decomposes complex datasets into simpler, more learnable components. We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to recursively generate easier versions of examples, enabling the student model to progressively master difficult tasks. We propose a novel scoring system to measure data difficulty based on its structural complexity and conceptual depth, allowing curriculum construction over decomposed data. Experiments on math datasets (MATH and AIME) and code generation datasets demonstrate that models trained with curricula generated by our approach exhibit superior performance compared to standard training on original datasets.
Related papers
- Bidirectional Curriculum Generation: A Multi-Agent Framework for Data-Efficient Mathematical Reasoning [16.95900718416944]
We introduce a novel Bidirectional Curriculum Generation framework to maximize the instructional value of every training sample.<n>Unlike rigid trajectories, our multi-agent ecosystem mimics adaptive pedagogy to establish a closed feedback loop.<n>This mechanism ensures that the model consumes only the most effective data at any given stage.
arXiv Detail & Related papers (2026-03-05T12:49:21Z) - C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning [78.36259648527401]
C2-Evo is an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities.<n>We show that C2-Evo consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-07-22T12:27:08Z) - Divide, Specialize, and Route: A New Approach to Efficient Ensemble Learning [0.0]
We propose Hellsemble, a novel ensemble framework for binary classification.<n>Hellsemble incrementally partitions the dataset into circles of difficulty.<n>It achieves strong classification accuracy while maintaining computational efficiency and interpretability.
arXiv Detail & Related papers (2025-06-25T20:26:04Z) - Constraint Back-translation Improves Complex Instruction Following of Large Language Models [55.60192044049083]
Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc.<n>Previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs.<n>We propose a novel data generation technique, constraint back-translation.
arXiv Detail & Related papers (2024-10-31T17:42:26Z) - RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs [12.846097618151951]
We develop a dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage.
We synthesize realistic user queries that integrate diverse topological structures, annotated information, and complex textual descriptions.
We introduce an enhanced Monte Carlo Tree Search (CTS) method, which automatically extracts relational path information from textual graphs for specific queries.
arXiv Detail & Related papers (2024-10-17T19:33:37Z) - Complexity-Guided Curriculum Learning for Text Graphs [18.746490400989487]
We present a curriculum learning approach that builds on existing knowledge about text and graph complexity formalisms.
A novel data scheduler employs "spaced repetition" and complexity formalisms to guide the training process.
We demonstrate the effectiveness of the proposed approach on several text graph tasks and graph neural network architectures.
arXiv Detail & Related papers (2023-11-22T15:40:57Z) - Homological Convolutional Neural Networks [4.615338063719135]
We propose a novel deep learning architecture that exploits the data structural organization through topologically constrained network representations.
We test our model on 18 benchmark datasets against 5 classic machine learning and 3 deep learning models.
arXiv Detail & Related papers (2023-08-26T08:48:51Z) - Hindsight States: Blending Sim and Real Task Elements for Efficient
Reinforcement Learning [61.3506230781327]
In robotics, one approach to generate training data builds on simulations based on dynamics models derived from first principles.
Here, we leverage the imbalance in complexity of the dynamics to learn more sample-efficiently.
We validate our method on several challenging simulated tasks and demonstrate that it improves learning both alone and when combined with an existing hindsight algorithm.
arXiv Detail & Related papers (2023-03-03T21:55:04Z) - Curriculum Learning: A Survey [65.31516318260759]
Curriculum learning strategies have been successfully employed in all areas of machine learning.
We construct a taxonomy of curriculum learning approaches by hand, considering various classification criteria.
We build a hierarchical tree of curriculum learning methods using an agglomerative clustering algorithm.
arXiv Detail & Related papers (2021-01-25T20:08:32Z) - Curriculum Learning with Diversity for Supervised Computer Vision Tasks [1.5229257192293197]
We introduce a novel curriculum sampling strategy which takes into consideration the diversity of the training data together with the difficulty of the inputs.
We prove that our strategy is very efficient for unbalanced data sets, leading to faster convergence and more accurate results.
arXiv Detail & Related papers (2020-09-22T15:32:49Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.