AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software Projects
- URL: http://arxiv.org/abs/2409.19304v1
- Date: Sat, 28 Sep 2024 10:21:37 GMT
- Title: AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software Projects
- Authors: Kexin Sun, Yiding Ren, Hongyu Kuang, Hui Gao, Xiaoxing Ma, Guoping Rong, Dong Shao, He Zhang,
- Abstract summary: Traceability plays a vital role in facilitating various software development activities.
The IR (Information Retrieval)-based approaches leverage textual similarity to measure the likelihood of traces between artifacts.
The globalization of software development has introduced new challenges, such as the possible multilingualism on the same concept.
- Score: 14.643142867163748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traceability plays a vital role in facilitating various software development activities by establishing the traces between different types of artifacts (e.g., issues and commits in software repositories). Among the explorations for automated traceability recovery, the IR (Information Retrieval)-based approaches leverage textual similarity to measure the likelihood of traces between artifacts and show advantages in many scenarios. However, the globalization of software development has introduced new challenges, such as the possible multilingualism on the same concept (e.g., "ShuXing" vs. "attribute") in the artifact texts, thus significantly hampering the performance of IR-based approaches. Existing research has shown that machine translation can help address the term inconsistency in bilingual projects. However, the translation can also bring in synonymous terms that are not consistent with those in the bilingual projects (e.g., another translation of "ShuXing" as "property"). Therefore, we propose an enhancement strategy called AVIATE that exploits translation variants from different translators by utilizing the word pairs that appear simultaneously across the translation variants from different kinds artifacts (a.k.a. consensual biterms). We use these biterms to first enrich the artifact texts, and then to enhance the calculated IR values for improving IR-based traceability recovery for bilingual software projects. The experiments on 17 bilingual projects (involving English and 4 other languages) demonstrate that AVIATE significantly outperformed the IR-based approach with machine translation (the state-of-the-art in this field) with an average increase of 16.67 in Average Precision (31.43%) and 8.38 (11.22%) in Mean Average Precision, indicating its effectiveness in addressing the challenges of multilingual traceability recovery.
Related papers
- LLM-based Translation Inference with Iterative Bilingual Understanding [45.00660558229326]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)
The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.
The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z) - Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z) - Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task.
We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z) - TRIAD: Automated Traceability Recovery based on Biterm-enhanced
Deduction of Transitive Links among Artifacts [53.92293118080274]
Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle.
Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR)
arXiv Detail & Related papers (2023-12-28T06:44:24Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.