Related papers: ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation

ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation

URL: http://arxiv.org/abs/2411.06145v4
Date: Mon, 14 Apr 2025 08:45:07 GMT
Title: ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation
Authors: Pengyu Xue, Linhao Wu, Zhen Yang, Chengyi Wang, Xiang Li, Yuxiang Zhang, Jia Li, Ruikai Jin, Yifei Pei, Zhaoyan Shen, Xiran Lyu, Jacky Wai Keung,
Abstract summary: We construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation.<n>It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites.<n> Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark.
Score: 19.69195067838796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, Large Language Models (LLMs) have dramatically advanced the performance of automated code translation, making their computational accuracy score reach up to over 80% on many previous benchmarks. However, most code samples in these benchmarks are short, standalone, statement/method-level, and algorithmic, which is not aligned with practical coding tasks. Therefore, it is still unknown the actual capability of LLMs in translating code samples written for daily development. To achieve this, we construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation. ClassEval-T is extended from ClassEval, a well-known class-level Python code generation benchmark consisting of multiple practical coding topics, such as database operation and game design, and diverse contextual dependencies (e.g., fields, methods, and libraries). It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites. Subsequently, we design three translation strategies (i.e., holistic, min-dependency, and standalone) for class-level code translations and evaluate eight recent LLMs of commercial, general, and code kinds in diverse families and sizes on ClassEval-T. Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark, and obvious discrepancies among LLMs appear, showing the effectiveness of ClassEval-T in measuring recent LLMs. Afterwards, we further discuss the usage scenarios for diverse translation strategies and LLMs' ability to dependency awareness when translating class samples. Finally, 1,243 failure cases made by the best-performing LLM under test are analyzed and categorized in this paper for practical guidance and future enlightenment.

Related papers

Text Classification in the LLM Era - Where do we stand? [2.7624021966289605]
Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. We investigated the role of such language models in text classification and how they compare with other approaches.
arXiv Detail & Related papers (2025-02-17T14:25:54Z)
A Comprehensive Analysis on LLM-based Node Classification Algorithms [21.120619437937382]
We develop a comprehensive and testbed for node classification using Large Language Models (LLMs) It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. We conduct extensive experiments, training and evaluating over 2,200 models, to determine the key settings that affect performance. Our findings uncover eight insights, e.g., LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting.
arXiv Detail & Related papers (2025-02-02T15:56:05Z)
A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks [0.0]
We evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation. Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase. These initial insights highlight the need for more comprehensive empirical studies.
arXiv Detail & Related papers (2024-11-23T06:40:47Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? [4.616570111453259]
Large language models (LLMs) exhibit state-of-the-art performance in various tasks, but struggle for code translation. We conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks. We propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data.
arXiv Detail & Related papers (2024-10-13T12:20:12Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
CUDRT: Benchmarking the Detection Models of Human vs. Large Language Models Generated Texts [9.682499180341273]
Large language models (LLMs) have greatly enhanced text generation across industries. Their human-like outputs make distinguishing between human and AI authorship challenging. Current benchmarks mainly rely on static datasets, limiting their effectiveness in assessing model-based detectors.
arXiv Detail & Related papers (2024-06-13T12:43:40Z)
LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification [13.319594321038926]
We propose a simple and effective transfer learning strategy, namely LLMEmbed, to address this classical but challenging task. We perform extensive experiments on publicly available datasets, and the results show that LLMEmbed achieves strong performance while enjoys low training overhead.
arXiv Detail & Related papers (2024-06-06T03:46:59Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs. This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z)
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM [13.324171480106715]
EvoEval is a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains. Our study shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance. We showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes.
arXiv Detail & Related papers (2024-03-28T03:10:39Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation. We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
An Examination of the Compositionality of Large Generative Vision-Language Models [7.639748270719836]
Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs.
arXiv Detail & Related papers (2023-08-21T06:50:29Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.