Related papers: Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

URL: http://arxiv.org/abs/2103.06369v2
Date: Fri, 12 Mar 2021 14:59:49 GMT
Title: Majority Voting with Bidirectional Pre-translation For Bitext Retrieval
Authors: Alex Jones and Derry Tanti Wijaya
Abstract summary: A popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages. In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods. We make the code and data used for our experiments publicly available.
Score: 2.580271290008534
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages. In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods on the Tatoeba similarity search benchmark and on a downstream task, namely NMT. We uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining approach, and echo problems with the oft-used BUCC dataset that have been observed by others. We make the code and data used for our experiments publicly available.

Related papers

Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT. This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z)
An approach for mistranslation removal from popular dataset for Indic MT Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment. The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z)
Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language. In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z)
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z)
HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints [7.452359972117693]
Back-translation of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT) We introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. We show that using these hints, both separately and together, significantly improves translation quality.
arXiv Detail & Related papers (2021-09-09T17:43:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.