xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource
Languages
- URL: http://arxiv.org/abs/2306.12907v1
- Date: Thu, 22 Jun 2023 14:20:15 GMT
- Title: xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource
Languages
- Authors: Mingda Chen, Kevin Heffernan, Onur \c{C}elebi, Alex Mourachko, Holger
Schwenk
- Abstract summary: We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++.
In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts.
- Score: 15.351726952216369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new proxy score for evaluating bitext mining based on
similarity in a multilingual embedding space: xSIM++. In comparison to xSIM,
this improved proxy leverages rule-based approaches to extend English sentences
in any evaluation set with synthetic, hard-to-distinguish examples which more
closely mirror the scenarios we encounter during large-scale mining. We
validate this proxy by running a significant number of bitext mining
experiments for a set of low-resource languages, and subsequently train NMT
systems on the mined data. In comparison to xSIM, we show that xSIM++ is better
correlated with the downstream BLEU scores of translation systems trained on
mined bitexts, providing a reliable proxy of bitext mining performance without
needing to run expensive bitext mining pipelines. xSIM++ also reports
performance for different error types, offering more fine-grained feedback for
model development.
Related papers
- RETSim: Resilient and Efficient Text Similarity [1.6228944467258688]
RETSim is a lightweight, multilingual deep learning model trained to produce robust metric embeddings for text retrieval, clustering, and dataset deduplication tasks.
We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings.
We also introduce the W4NT3D benchmark for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings.
arXiv Detail & Related papers (2023-11-28T22:54:33Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer
Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification.
Multiple global proxies are leveraged to collectively approximate the original data points for each class.
We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Nearest Neighbor Machine Translation [113.96357168879548]
We introduce $k$-nearest-neighbor machine translation ($k$NN-MT)
It predicts tokens with a nearest neighbor classifier over a large datastore of cached examples.
It consistently improves performance across many settings.
arXiv Detail & Related papers (2020-10-01T22:24:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.