OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval
- URL: http://arxiv.org/abs/2205.08605v1
- Date: Tue, 17 May 2022 19:52:42 GMT
- Title: OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval
- Authors: Tong Niu, Kazuma Hashimoto, Yingbo Zhou, Caiming Xiong
- Abstract summary: We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
- Score: 91.76575626229824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning parallel sentences in multilingual corpora is essential to curating
data for downstream applications such as Machine Translation. In this work, we
present OneAligner, an alignment model specially designed for sentence
retrieval tasks. This model is able to train on only one language pair and
transfers, in a cross-lingual fashion, to low-resource language pairs with
negligible degradation in performance. When trained with all language pairs of
a large-scale parallel multilingual corpus (OPUS-100), this model achieves the
state-of-the-art result on the Tateoba dataset, outperforming an equally-sized
previous model by 8.0 points in accuracy while using less than 0.6% of their
parallel data. When finetuned on a single rich-resource language pair, be it
English-centered or not, our model is able to match the performance of the ones
finetuned on all language pairs under the same data budget with less than 2.0
points decrease in accuracy. Furthermore, with the same setup, scaling up the
number of rich-resource language pairs monotonically improves the performance,
reaching a minimum of 0.4 points discrepancy in accuracy, making it less
mandatory to collect any low-resource parallel data. Finally, we conclude
through empirical results and analyses that the performance of the sentence
alignment task depends mostly on the monolingual and parallel data size, up to
a certain size threshold, rather than on what language pairs are used for
training or evaluation.
Related papers
- Efficient Adapter Finetuning for Tail Languages in Streaming
Multilingual ASR [44.949146169903074]
The heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation.
Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale.
arXiv Detail & Related papers (2024-01-17T06:01:16Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.