Related papers: xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

URL: http://arxiv.org/abs/2306.12907v1
Date: Thu, 22 Jun 2023 14:20:15 GMT
Title: xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Authors: Mingda Chen, Kevin Heffernan, Onur \c{C}elebi, Alex Mourachko, Holger Schwenk
Abstract summary: We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts.
Score: 15.351726952216369
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

Related papers

RETSim: Resilient and Efficient Text Similarity [1.6228944467258688]
RETSim is a lightweight, multilingual deep learning model trained to produce robust metric embeddings for text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings. We also introduce the W4NT3D benchmark for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings.
arXiv Detail & Related papers (2023-11-28T22:54:33Z)
Leveraging Language Identification to Enhance Code-Mixed Text Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text. Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z)
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing. Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z)
Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language. We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally. We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z)
Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages. We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification. Multiple global proxies are leveraged to collectively approximate the original data points for each class. We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z)
Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
Nearest Neighbor Machine Translation [113.96357168879548]
We introduce $k$-nearest-neighbor machine translation ($k$NN-MT) It predicts tokens with a nearest neighbor classifier over a large datastore of cached examples. It consistently improves performance across many settings.
arXiv Detail & Related papers (2020-10-01T22:24:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.