Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects
- URL: http://arxiv.org/abs/2507.19271v1
- Date: Fri, 25 Jul 2025 13:49:24 GMT
- Title: Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects
- Authors: Igli Begolli, Meltem Aksoy, Daniel Neider,
- Abstract summary: This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source language models (LMs)<n>We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C# specific dataset combining public benchmarks with industrial repositories.<n>Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines.
- Score: 4.3012765978447565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.
Related papers
- IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages [0.1906498126334485]
This paper evaluates the capabilities of the Llama 2-70B model in automating scientific applications written in programming languages.<n>We assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between programming languages.<n>Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations.
arXiv Detail & Related papers (2025-03-24T23:46:14Z) - Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation [5.6001617185032595]
Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks.
We fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank fashion on consumer-grade hardware to improve review comment generation.
arXiv Detail & Related papers (2024-11-15T12:01:38Z) - Training of Scaffolded Language Models with Language Supervision: A Survey [62.59629932720519]
This survey organizes the literature on the design and optimization of emerging structures around post-trained LMs.<n>We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools.
arXiv Detail & Related papers (2024-10-21T18:06:25Z) - From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation [30.914387085368734]
Large Language Models (LLMs) have demonstrated promising capabilities for code generation.<n>In this paper, we aim to investigate the potential linguistic bias through the lens of English and Chinese.
arXiv Detail & Related papers (2024-06-02T03:22:30Z) - Automating Patch Set Generation from Code Review Comments Using Large Language Models [2.045040820541428]
We provide code contexts to five popular Large Language Models (LLMs)
We obtain the suggested code-changes (patch sets) derived from real-world code-review comments.
The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets.
arXiv Detail & Related papers (2024-04-10T02:46:08Z) - Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.