Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance
- URL: http://arxiv.org/abs/2002.00761v2
- Date: Sun, 11 Oct 2020 05:26:32 GMT
- Title: Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance
- Authors: Ahmed El-Kishky, Francisco Guzm\'an
- Abstract summary: Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.
We develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages.
These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs.
- Score: 8.395430195053061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. Such
aligned data can be used for a variety of NLP tasks from training cross-lingual
representations to mining parallel data for machine translation. In this paper
we develop an unsupervised scoring function that leverages cross-lingual
sentence embeddings to compute the semantic distance between documents in
different languages. These semantic distances are then used to guide a document
alignment algorithm to properly pair cross-lingual web documents across a
variety of low, mid, and high-resource language pairs. Recognizing that our
proposed scoring function and other state of the art methods are
computationally intractable for long web documents, we utilize a more tractable
greedy algorithm that performs comparably. We experimentally demonstrate that
our distance metric performs better alignment than current baselines
outperforming them by 7% on high-resource language pairs, 15% on mid-resource
language pairs, and 22% on low-resource language pairs.
Related papers
- Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Multilingual Representation Distillation with Contrastive Learning [20.715534360712425]
We integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences.
We validate our approach with multilingual similarity search and corpus filtering tasks.
arXiv Detail & Related papers (2022-10-10T22:27:04Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - CDA: a Cost Efficient Content-based Multilingual Web Document Aligner [97.98885151955467]
We introduce a Content-based Document Alignment approach to align multilingual web documents based on content.
We leverage lexical translation models to build vector representations using TF-IDF.
Experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.
arXiv Detail & Related papers (2021-02-20T03:37:23Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Bilingual Text Extraction as Reading Comprehension [23.475200800530306]
We propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction.
To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT.
arXiv Detail & Related papers (2020-04-29T23:41:32Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.