JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)
- URL: http://arxiv.org/abs/2602.09930v1
- Date: Tue, 10 Feb 2026 16:04:00 GMT
- Title: JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)
- Authors: Nishil Amin, Zhiwei Fei, Xiang Li, Justyna Petke, He Ye,
- Abstract summary: We build a benchmark to evaluate large language models (LLMs) for source code migration tasks.<n>We first collected a dataset of function pairs from open-source repositories, but limitations in data quality led us to construct a refined dataset.<n>Using this dataset, the Mistral Codestral model was evaluated with CodeBLEU and metrics to measure lexical and semantic similarity as well as migration correctness.
- Score: 9.302832064874357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We build a benchmark to evaluate large language models (LLMs) for source code migration tasks, specifically upgrading functions from Java 8 to Java 11. We first collected a dataset of function pairs from open-source repositories, but limitations in data quality led us to construct a refined dataset covering eight categories of deprecated APIs. Using this dataset, the Mistral Codestral model was evaluated with CodeBLEU and keyword-based metrics to measure lexical and semantic similarity as well as migration correctness. Results show that the evaluated model (Mistral Codestral) can handle trivial one-to-one API substitutions with moderate success, achieving identical migrations in 11.11% of the cases, but it struggles with more complex migrations such as CORBA or JAX-WS. These findings suggest Mistral Codestral can partially reduce developer effort by automating repetitive migration tasks but cannot yet replace humans within the scope of the JMigBench benchmark. The benchmark and analysis provide a foundation for future work on expanding datasets, refining prompting strategies, and improving migration performance across different LLMs.
Related papers
- SPELL: Synthesis of Programmatic Edits using LLMs [10.41623927140964]
Library migration is a common but error-prone task in software development.<n>We present a new approach to automated API migration that sidesteps the limitations described above.
arXiv Detail & Related papers (2026-02-01T09:03:56Z) - Automatic Qiskit Code Refactoring Using Large Language Models [39.71511919246829]
We present a novel methodology for Qiskit code using large language models (LLMs)<n>We begin by extracting a taxonomy of migration scenarios from the different sources of official Qiskit documentation.<n>This taxonomy, along with the original Python source code, is provided as input to an LLM, which is then tasked with identifying instances of migration scenarios in the code.
arXiv Detail & Related papers (2025-06-17T14:00:48Z) - CODEMENV: Benchmarking Large Language Models on Code Migration [11.735053997817765]
CODEMENV consists of 922 examples spanning 19 Python and Java packages.<n>It covers three core tasks: identifying functions incompatible with specific versions, detecting changes in function definitions, and adapting code to target environments.<n> Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%.
arXiv Detail & Related papers (2025-06-01T08:29:59Z) - An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z) - MigrationBench: Repository-Level Code Migration Benchmark from Java 8 [18.648973521771396]
MigrationBench is a comprehensive benchmark for migration from Java $8$ to the latest long-term support (LTS) versions (Java $17$, $21$)<n>We provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of large language models (LLMs) on this challenging task.<n>For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves $62.33%$ and $27.33%$ success rate (pass@1) for minimal and maximal migration respectively.
arXiv Detail & Related papers (2025-05-14T17:11:23Z) - SweRank: Software Issue Localization with Code Ranking [109.3289316191729]
SweRank is an efficient retrieve-and-rerank framework for software issue localization.<n>We construct SweLoc, a large-scale dataset curated from public GitHub repositories.<n>We show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems.
arXiv Detail & Related papers (2025-05-07T19:44:09Z) - MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions [53.811953357289866]
Large language models (LLMs) have shown remarkable progress across various domains.<n>LLMs struggle with incomplete code context understanding and inaccurate migration point identification.<n>MigGPT is a framework that employs a novel code fingerprint structure to retain code snippet information.
arXiv Detail & Related papers (2025-04-13T08:08:37Z) - Retrieval-augmented code completion for local projects using large language models [0.0]
We train two open transformer-based models, the generative GPT-2 and the retrieval-adapted RETRO, on open-source Python files.<n>We improve our models' performance with In-context retrieval-augmented generation (RAG), which retrieves code snippets using the Jaccard similarity of tokens.<n> Experimental results indicate that In-context RAG improves the code completion baseline by over 26%, while RETRO improves over the similarly sized GPT-2 baseline by 12%.
arXiv Detail & Related papers (2024-08-09T12:26:57Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.