Majority Voting with Bidirectional Pre-translation For Bitext Retrieval
- URL: http://arxiv.org/abs/2103.06369v2
- Date: Fri, 12 Mar 2021 14:59:49 GMT
- Title: Majority Voting with Bidirectional Pre-translation For Bitext Retrieval
- Authors: Alex Jones and Derry Tanti Wijaya
- Abstract summary: A popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages.
In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods.
We make the code and data used for our experiments publicly available.
- Score: 2.580271290008534
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Obtaining high-quality parallel corpora is of paramount importance for
training NMT systems. However, as many language pairs lack adequate
gold-standard training data, a popular approach has been to mine so-called
"pseudo-parallel" sentences from paired documents in two languages. In this
paper, we outline some problems with current methods, propose computationally
economical solutions to those problems, and demonstrate success with novel
methods on the Tatoeba similarity search benchmark and on a downstream task,
namely NMT. We uncover the effect of resource-related factors (i.e. how much
monolingual/bilingual data is available for a given language) on the optimal
choice of bitext mining approach, and echo problems with the oft-used BUCC
dataset that have been observed by others. We make the code and data used for
our experiments publicly available.
Related papers
- An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language.
In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems.
We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - HintedBT: Augmenting Back-Translation with Quality and Transliteration
Hints [7.452359972117693]
Back-translation of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT)
We introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder.
We show that using these hints, both separately and together, significantly improves translation quality.
arXiv Detail & Related papers (2021-09-09T17:43:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.