Related papers: Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go

Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go

URL: http://arxiv.org/abs/2511.10868v1
Date: Fri, 14 Nov 2025 00:35:00 GMT
Title: Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go
Authors: Yashshi Pipalani, Hritik Raj, Rajat Ghosh, Vaishnavi Bhargava, Debojyoti Dutta,
Abstract summary: GO UT Bench is a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories.<n>Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.
Score: 0.9705942111373044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation. To address this gap, we introduce GO UT Bench, a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories spanning diverse domain. We evaluate its effectiveness as a fine tuning dataset across two LLM families i.e. mixture of experts and dense decoders. Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.

Related papers

Improving Code Generation via Small Language Model-as-a-judge [14.067404766521607]
We train several state-of-the-art SLMs as code correctness judges and assess their ability to discriminate between correct and wrong implementations.<n>We show that modern SLMs outperform RankEF, even without exploiting execution-based information.
arXiv Detail & Related papers (2026-02-12T13:07:36Z)
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.<n>Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.<n>We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering [11.087034068992653]
We introduce CodeRepoQA, a large-scale benchmark for evaluating repository-level question-answering capabilities in software engineering.<n>CodeRepoQA encompasses five programming languages and covers a wide range of scenarios, enabling comprehensive evaluation of language models.<n>In total, CodeRepoQA is a multi-turn question-answering benchmark with 585,687 entries, covering a diverse array of software engineering scenarios.
arXiv Detail & Related papers (2024-12-19T11:48:01Z)
Evaluation of Code LLMs on Geospatial Code Generation [1.6834474847800562]
Large Language Models (LLMs) can generate Python code for data science and machine learning applications.<n>Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks.<n>Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy.
arXiv Detail & Related papers (2024-10-06T20:34:03Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.