Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks
- URL: http://arxiv.org/abs/2510.23208v1
- Date: Mon, 27 Oct 2025 10:54:25 GMT
- Title: Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks
- Authors: Amal Abed, Ivan Lukic, Jörg K. H. Franke, Frank Hutter,
- Abstract summary: Large language models (LLMs) have shown impressive promise in code generation.<n>We present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets.
- Score: 41.75017840131367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources pair problems with solutions, but omit the intermediate thought process that guides coding. To close this gap, we present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets. Each sample combines a task, a step-by-step reasoning trace, a working solution, and executable tests, enabling models to learn not just the what but also the how of problem solving. Our pipeline combines four key components: curated contest problems, web-mined content filtered by relevance classifiers, data expansion guided by reasoning patterns, and multi-stage execution-based validation. A genetic mutation algorithm further increases task diversity while maintaining consistency between reasoning traces and code implementations. Our key finding is that fine-tuning LLMs on this dataset yields consistent improvements on coding benchmarks. Beyond raw accuracy, reasoning-aware data can substitute for model scaling, generalize across architectures, and outperform leading open-source alternatives under identical sample budgets. Our work establishes reasoning-centered synthetic data generation as an efficient approach for advancing coding capabilities in LLMs. We publish our dataset and generation pipeline to facilitate further research.
Related papers
- X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests [47.271827881215295]
We propose a fully synthetic approach to training Code LLMs with entirely generated tasks, solutions, and test cases.<n>To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith.<n> SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests.<n>X-Coder model series achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B.
arXiv Detail & Related papers (2026-01-11T15:22:33Z) - Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z) - VERIRL: Boosting the LLM-based Verilog Code Generation via Reinforcement Learning [32.974199255760944]
We introduce a reinforcement learning framework tailored for Verilog code generation.<n>To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism.<n>To mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy.
arXiv Detail & Related papers (2025-08-25T20:20:44Z) - Synthesis by Design: Controlled Data Generation via Structural Guidance [7.938713951512933]
We propose to extract structural information with generated problem-solving code from mathematical reasoning.<n>Our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty.<n>Results on our benchmark show that model performance declines as reasoning length increases.
arXiv Detail & Related papers (2025-06-09T11:38:23Z) - Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains.<n>We highlight key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement.<n>We discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.