Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval
- URL: http://arxiv.org/abs/2101.08370v1
- Date: Thu, 21 Jan 2021 00:15:38 GMT
- Title: Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval
- Authors: Robert Litschko and Ivan Vuli\'c and Simone Paolo Ponzetto and Goran
Glava\v{s}
- Abstract summary: We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
- Score: 51.60862829942932
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pretrained multilingual text encoders based on neural Transformer
architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong
performance on a myriad of language understanding tasks. Consequently, they
have been adopted as a go-to paradigm for multilingual and cross-lingual
representation learning and transfer, rendering cross-lingual word embeddings
(CLWEs) effectively obsolete. However, questions remain to which extent this
finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual
IR (CLIR) tasks. Therefore, in this work we present a systematic empirical
study focused on the suitability of the state-of-the-art multilingual encoders
for cross-lingual document and sentence retrieval tasks across a large number
of language pairs. In contrast to supervised language understanding, our
results indicate that for unsupervised document-level CLIR -- a setup with no
relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to
significantly outperform models based on CLWEs. For sentence-level CLIR, we
demonstrate that state-of-the-art performance can be achieved. However, the
peak performance is not met using the general-purpose multilingual text
encoders `off-the-shelf', but rather relying on their variants that have been
further specialized for sentence understanding tasks.
Related papers
- Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Syntax-augmented Multilingual BERT for Cross-lingual Transfer [37.99210035238424]
This work shows that explicitly providing language syntax and training mBERT helps cross-lingual transfer.
Experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks.
arXiv Detail & Related papers (2021-06-03T21:12:50Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - On Learning Universal Representations Across Languages [37.555675157198145]
We extend existing approaches to learn sentence-level representations and show the effectiveness on cross-lingual understanding and generation.
Specifically, we propose a Hierarchical Contrastive Learning (HiCTL) method to learn universal representations for parallel sentences distributed in one or multiple languages.
We conduct evaluations on two challenging cross-lingual tasks, XTREME and machine translation.
arXiv Detail & Related papers (2020-07-31T10:58:39Z) - Enhancing Answer Boundary Detection for Multilingual Machine Reading
Comprehension [86.1617182312817]
We propose two auxiliary tasks in the fine-tuning stage to create additional phrase boundary supervision.
A mixed Machine Reading task, which translates the question or passage to other languages and builds cross-lingual question-passage pairs.
A language-agnostic knowledge masking task by leveraging knowledge phrases mined from web.
arXiv Detail & Related papers (2020-04-29T10:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.