MigrationBench: Repository-Level Code Migration Benchmark from Java 8
- URL: http://arxiv.org/abs/2505.09569v2
- Date: Mon, 19 May 2025 16:10:21 GMT
- Title: MigrationBench: Repository-Level Code Migration Benchmark from Java 8
- Authors: Linbo Liu, Xinle Liu, Qiang Zhou, Lin Chen, Yihan Liu, Hoan Nguyen, Behrooz Omidvar-Tehrani, Xi Shen, Jun Huan, Omer Tripp, Anoop Deoras,
- Abstract summary: MigrationBench is a comprehensive benchmark for migration from Java $8$ to the latest long-term support (LTS) versions (Java $17$, $21$)<n>We provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of large language models (LLMs) on this challenging task.<n>For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves $62.33%$ and $27.33%$ success rate (pass@1) for minimal and maximal migration respectively.
- Score: 18.648973521771396
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on code generation and issue-resolution tasks. In contrast, we introduce a new coding benchmark MigrationBench with a distinct focus: code migration. MigrationBench aims to serve as a comprehensive benchmark for migration from Java $8$ to the latest long-term support (LTS) versions (Java $17$, $21$), including a full dataset and its subset selected with $5,102$ and $300$ repositories respectively. Selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration. Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task. We further propose SD-Feedback and demonstrate that LLMs can effectively tackle repository-level code migration to Java $17$. For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves $62.33\%$ and $27.33\%$ success rate (pass@1) for minimal and maximal migration respectively. The benchmark dataset and source code are available at: https://huggingface.co/collections/AmazonScience/migrationbench-68125452fc21a4564b92b6c3 and https://github.com/amazon-science/MigrationBench respectively.
Related papers
- CODEMENV: Benchmarking Large Language Models on Code Migration [11.735053997817765]
CODEMENV consists of 922 examples spanning 19 Python and Java packages.<n>It covers three core tasks: identifying functions incompatible with specific versions, detecting changes in function definitions, and adapting code to target environments.<n> Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%.
arXiv Detail & Related papers (2025-06-01T08:29:59Z) - LLM-KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLMs [0.12564343689544843]
Current Large Language Models (LLMs) can assist developing program code beside many other things, but can they support working with Knowledge Graphs (KGs) as well?<n>The LLM-KG-Bench framework in Version 3.0 is designed to answer these questions.<n>It consists of a set of tasks for automated evaluation of LLM answers and covers different aspects of working with semantic technologies.
arXiv Detail & Related papers (2025-05-19T13:29:27Z) - Using LLMs for Library Migration [1.9247157750972368]
Large Language Models (LLMs) are good at generating and transforming code and finding similar code.<n>We evaluate three LLMs, LLama 3.1, GPT-4o mini, and GPT-4o on PyMigBench, where we migrate 321 real-world library migrations.<n>LLama 3.1, GPT-4o mini, and GPT-4o correctly migrate 89%, 89%, and 94% of the migration-related code changes respectively.
arXiv Detail & Related papers (2025-04-17T18:32:48Z) - Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization [7.330697128881243]
We propose Metagente as a novel approach to amplify the synergy of various Large Language Models (LLMs)<n>Metagente is a Multi-Agent framework based on a series of LLMs to self-optimize the system through evaluation, feedback, and cooperation among specialized agents.<n>The performance gain compared to GitSum, the most relevant benchmark, ranges from 27.63% to 60.43%.
arXiv Detail & Related papers (2025-03-13T20:42:39Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.<n>SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.<n>We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation [19.69195067838796]
We construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation.<n>It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites.<n> Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark.
arXiv Detail & Related papers (2024-11-09T11:13:14Z) - Evaluation of Code LLMs on Geospatial Code Generation [1.6834474847800562]
Large Language Models (LLMs) can generate Python code for data science and machine learning applications.<n>Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks.<n>Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy.
arXiv Detail & Related papers (2024-10-06T20:34:03Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.