Escalating LLM-based Code Translation Benchmarking into the Class-level Era
- URL: http://arxiv.org/abs/2411.06145v2
- Date: Tue, 19 Nov 2024 07:19:27 GMT
- Title: Escalating LLM-based Code Translation Benchmarking into the Class-level Era
- Authors: Pengyu Xue, Linhao Wu, Chengyi Wang, Xiang Li, Zhen Yang, Ruikai Jin, Yuxiang Zhang, Jia Li, Yifei Pei, Zhaoyan Shen, Xiran Lyu,
- Abstract summary: ClassEval-T is a class-level code translation benchmark for Large Language Models (LLMs)
Built upon ClassEval, ClassEval-T extends into Java and C++ with complete code samples and test suites, requiring 360 person-hours for manual migration.
- Score: 20.22104136730419
- License:
- Abstract: In recent years, Large Language Models (LLMs) have significantly improved automated code translation, often achieving over 80% accuracy on existing benchmarks. However, most of these benchmarks consist of short, standalone, algorithmic samples that do not reflect practical coding tasks. To address this gap, we introduce ClassEval-T, a class-level code translation benchmark designed to assess LLM performance on real-world coding scenarios. Built upon ClassEval, a class-level Python code generation benchmark covering topics such as database operations and game design, ClassEval-T extends into Java and C++ with complete code samples and test suites, requiring 360 person-hours for manual migration. We propose three translation strategies (holistic, min-dependency, and standalone) and evaluate six recent LLMs across various families and sizes on ClassEval-T. Results reveal a significant performance drop compared to method-level benchmarks, highlighting discrepancies among LLMs and demonstrating ClassEval-T's effectiveness. We further analyze LLMs' dependency awareness in translating class samples and categorize 1,397 failure cases by the best-performing LLM for practical insights and future improvement.
Related papers
- Stacking Small Language Models for Generalizability [0.0]
Large language models (LLMs) generalize strong performance across different natural language benchmarks.
This paper introduces a new approach called fine-tuning stacks of language models (FSLM)
By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for.
As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language.
arXiv Detail & Related papers (2024-10-21T01:27:29Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM [13.324171480106715]
EvoEval is a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains.
Our study shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance.
We showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes.
arXiv Detail & Related papers (2024-03-28T03:10:39Z) - State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances.
To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - An Examination of the Compositionality of Large Generative Vision-Language Models [7.639748270719836]
Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning.
In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs.
We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs.
arXiv Detail & Related papers (2023-08-21T06:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.