On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
- URL: http://arxiv.org/abs/2308.08961v1
- Date: Thu, 17 Aug 2023 13:05:27 GMT
- Title: On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
- Authors: Mingsheng Jiao, Tingrui Yu, Xuan Li, Guanjie Qiu, Xiaodong Gu, Beijun
Shen
- Abstract summary: We develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence.
We then conduct a thorough analysis of how existing approaches perform across these four categories.
Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4.
- Score: 12.431884660186281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, neural code translation has gained increasing attention.
While most of the research focuses on improving model architectures and
training processes, we notice that the evaluation process and benchmark for
code translation models are severely limited: they primarily treat source code
as natural languages and provide a holistic accuracy score while disregarding
the full spectrum of model capabilities across different translation types and
complexity. In this paper, we present a comprehensive investigation of four
state-of-the-art models and analyze in-depth the advantages and limitations of
three existing benchmarks. Based on the empirical results, we develop a
taxonomy that categorizes code translation tasks into four primary types
according to their complexity and knowledge dependence: token level (type 1),
syntactic level (type 2), library level (type 3), and algorithm level (type 4).
We then conduct a thorough analysis of how existing approaches perform across
these four categories. Our findings indicate that while state-of-the-art code
translation models excel in type-1 and type-2 translations, they struggle with
knowledge-dependent ones such as type-3 and type-4. Existing benchmarks are
biased towards trivial translations, such as keyword mapping. To overcome these
limitations, we construct G-TransEval, a new benchmark by manually curating
type-3 and type-4 translation pairs and unit test cases. Results on our new
benchmark suggest that G-TransEval can exhibit more comprehensive and
finer-grained capability of code translation models and thus provide a more
rigorous evaluation. Our studies also provide more insightful findings and
suggestions for future research, such as building type-3 and type-4 training
data and ensembling multiple pretraining approaches.
Related papers
- Repository-level Code Translation Benchmark Targeting Rust [28.25765853736366]
We introduce first repository-level code translation benchmark comprising 375 tasks targeting Rust.
Using this benchmark, we study four state-of-the-art large language models (LLMs)
Our findings reveal that LLMs exhibit substantially worse performance (41.5%-56.2% Pass@1 drop of GPT-4) on repository-level translations compared to simpler tasks.
arXiv Detail & Related papers (2024-11-21T10:00:52Z) - Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels [20.05501751993599]
GPT-4 achieves performance comparable to junior-level translators in terms of total errors.
Unlike traditional Neural Machine Translation systems, GPT-4 maintains consistent translation quality across all evaluated language pairs.
arXiv Detail & Related papers (2024-11-21T01:12:46Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection.
We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z) - On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document.
In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Rethinking Document-level Neural Machine Translation [73.42052953710605]
We try to answer the question: Is the capacity of current models strong enough for document-level translation?
We observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words.
arXiv Detail & Related papers (2020-10-18T11:18:29Z) - Adapting Deep Learning for Sentiment Classification of Code-Switched
Informal Short Text [1.6752182911522517]
We present a labeled dataset called MultiSenti for sentiment classification of code-switched informal short text.
We propose a deep learning-based model for sentiment classification of code-switched informal short text.
arXiv Detail & Related papers (2020-01-04T06:31:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.