Related papers: SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion

SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion

URL: http://arxiv.org/abs/2508.15495v3
Date: Wed, 17 Sep 2025 06:15:46 GMT
Title: SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion
Authors: Dongjun Yu, Xiao Yan, Zhenrui Li, Jipeng Xiao, Haochuan He, Yongda Yu, Hao Zhang, Guoping Rong, Xiaobo Huang,
Abstract summary: Code completion is a prominent application of Large Language Models (LLMs) in software engineering.<n>This paper proposes SynthCoder, a model that integrates leading industry practices to achieve state-of-the-art on the Fill-in-the-Middle (FIM) code completion task.
Score: 7.668823606571788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code completion is a prominent application of Large Language Models (LLMs) in software engineering. Due to the near real-time response requirements of this task, base models with small to medium-sized parameters are typically employed, supplemented by various optimization and post-training techniques. However, these optimization methods often have trade-offs, leading to a seesaw effect where performance improvements on certain datasets or metrics are accompanied by degradations on others -- sometimes even falling below the baseline model's performance. This paper proposes SynthCoder, a model that integrates leading industry practices to achieve state-of-the-art performance on the Fill-in-the-Middle (FIM) code completion task. In specific, we first construct a diverse dataset by combining Abstract Syntax Tree (AST) node extraction with heuristics that simulate developer behavior. Then we enrich our training corpus with cross-file contextual information using the BM25 algorithm and call graphs, enhancing the model's ability to perform code completion in both file-level and repository-level scenarios. As the last step, we employ a two-stage training process using the Seed-Coder-8B-Base as the base model. First, we fine-tune the model using Curriculum Learning technology. Following this, we perform alignment using Direct Preference Optimization (DPO) with preference pairs generated through Rejection Sampling. Experimental results demonstrate that our final model excels on mainstream repository-level code completion benchmarks, including aiXcoder, ExecRepoBench, CrossCodeEval, and CoLT. Furthermore, our carefully curated training set effectively mitigates the model's tendency to just repeat existing code, a common issue existing in various code completion models.

Related papers

Bagging-Based Model Merging for Robust General Text Embeddings [73.51674133699196]
General-purpose text embedding models underpin a wide range of NLP and information retrieval applications.<n>We present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging.<n>We propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model.
arXiv Detail & Related papers (2026-02-05T15:45:08Z)
MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines [23.370444162993707]
We propose an emphinductive approach to machine unlearning (MU)<n>Unlearning can be done by reversing the last training sequence. This is implemented by appending a projection-redistribution layer in the end of the model.<n>Experiment results show consistently similar output to a fully retrained model with a high computational cost reduction.
arXiv Detail & Related papers (2025-10-30T08:09:37Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Towards A Generalist Code Embedding Model Based On Massive Data Synthesis [35.04242699869519]
We introduce textbfCodeR (underlineCode underlineRetrieval), a state-of-the-art embedding model for general-purpose code retrieval.<n>The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU principle.
arXiv Detail & Related papers (2025-05-19T04:37:53Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models [38.41524186248607]
We introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets.<n>For model architecture, we propose a latent attention layer to obtain pooled embeddings.<n>For training algorithm, we introduce a two-stage contrastive instruction-tuning method.
arXiv Detail & Related papers (2024-05-27T17:59:45Z)
Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost. We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z)
DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models. We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn. Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
Assemble Foundation Models for Automatic Code Summarization [9.53949558569201]
We propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. We introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning.
arXiv Detail & Related papers (2022-01-13T21:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.