Related papers: English Please: Evaluating Machine Translation for Multilingual Bug Reports

English Please: Evaluating Machine Translation for Multilingual Bug Reports

URL: http://arxiv.org/abs/2502.14338v2
Date: Tue, 04 Mar 2025 23:24:09 GMT
Title: English Please: Evaluating Machine Translation for Multilingual Bug Reports
Authors: Avinash Patil, Aryan Jadon,
Abstract summary: This study is the first comprehensive evaluation of machine translation (MT) performance on bug reports.<n>We employ multiple machine translation metrics, including BLEU, BERTScore, COMET, METEOR, and ROUGE.<n>DeepL consistently outperforms the other systems, demonstrating strong lexical and semantic alignment.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and ChatGPT using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To thoroughly assess the accuracy and effectiveness of each system, we employ multiple machine translation metrics, including BLEU, BERTScore, COMET, METEOR, and ROUGE. Our findings indicate that DeepL consistently outperforms the other systems across most automatic metrics, demonstrating strong lexical and semantic alignment. AWS Translate performs competitively, particularly in METEOR, while ChatGPT lags in key metrics. This study underscores the importance of domain adaptation for translating technical texts and offers guidance for integrating automated translation into bug-triaging workflows. Moreover, our results establish a foundation for future research to refine machine translation solutions for specialized engineering contexts. The code and dataset for this paper are available at GitHub: https://github.com/av9ash/gitbugs/tree/main/multilingual.

Related papers

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing [1.3062731746155414]
COMI-LINGUA is the largest manually annotated Hindi-English code-mixed dataset.<n>It comprises 125K+ high-quality instances across five core NLP tasks.<n>Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations.
arXiv Detail & Related papers (2025-03-27T16:36:39Z)
Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation [55.73341401764367]
We introduce ADSQE, a novel framework for alleviating distribution shift in synthetic QE data. ADSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes. Experiments demonstrate that ADSQE outperforms SOTA baselines like COMET in both supervised and unsupervised settings.
arXiv Detail & Related papers (2025-02-27T10:11:53Z)
A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations [0.4499833362998489]
This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT.
arXiv Detail & Related papers (2024-09-04T13:49:45Z)
Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
Automated Multi-Language to English Machine Translation Using Generative Pre-Trained Transformers [0.8192907805418583]
This study examines using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation.
arXiv Detail & Related papers (2024-04-23T02:19:35Z)
Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task. We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z)
Leveraging Language Identification to Enhance Code-Mixed Text Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text. Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z)
MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z)
Rethinking Round-Trip Translation for Machine Translation Evaluation [44.83568796515321]
We report the surprising finding that round-trip translation can be used for automatic evaluation without the references. We demonstrate the rectification is overdue as round-trip translation could benefit multiple machine translation evaluation tasks.
arXiv Detail & Related papers (2022-09-15T15:06:20Z)
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result. We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z)
BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages. In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors. We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR) AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities. Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
Don't Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings [11.042674237070012]
We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results. We recommend providing oracle scores alongside zero-shot results: still fine-tune using English data, but choose a checkpoint with the target dev set.
arXiv Detail & Related papers (2020-04-30T17:47:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.