A General-Purpose Multilingual Document Encoder
- URL: http://arxiv.org/abs/2305.07016v1
- Date: Thu, 11 May 2023 17:55:45 GMT
- Title: A General-Purpose Multilingual Document Encoder
- Authors: Onur Galo\u{g}lu and Robert Litschko and Goran Glava\v{s}
- Abstract summary: We pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE)
We leverage Wikipedia as a readily available source of comparable documents for creating training data.
We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks.
- Score: 9.868221447090855
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Massively multilingual pretrained transformers (MMTs) have tremendously
pushed the state of the art on multilingual NLP and cross-lingual transfer of
NLP models in particular. While a large body of work leveraged MMTs to mine
parallel data and induce bilingual document embeddings, much less effort has
been devoted to training general-purpose (massively) multilingual document
encoder that can be used for both supervised and unsupervised document-level
tasks. In this work, we pretrain a massively multilingual document encoder as a
hierarchical transformer model (HMDE) in which a shallow document transformer
contextualizes sentence representations produced by a state-of-the-art
pretrained multilingual sentence encoder. We leverage Wikipedia as a readily
available source of comparable documents for creating training data, and train
HMDE by means of a cross-lingual contrastive objective, further exploiting the
category hierarchy of Wikipedia for creation of difficult negatives. We
evaluate the effectiveness of HMDE in two arguably most common and prominent
cross-lingual document-level tasks: (1) cross-lingual transfer for topical
document classification and (2) cross-lingual document retrieval. HMDE is
significantly more effective than (i) aggregations of segment-based
representations and (ii) multilingual Longformer. Crucially, owing to its
massively multilingual lower transformer, HMDE successfully generalizes to
languages unseen in document-level pretraining. We publicly release our code
and models at
https://github.com/ogaloglu/pre-training-multilingual-document-encoders .
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents.
We propose a simple and effective pre-training objective - Document Reordering Machine Translation.
DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z) - Multilingual Document-Level Translation Enables Zero-Shot Transfer From
Sentences to Documents [19.59133362105703]
Document-level neural machine translation (DocNMT) delivers coherent translations by incorporating cross-sentence context.
We study whether and how contextual modeling in DocNMT is transferable from sentences to documents in a zero-shot fashion.
arXiv Detail & Related papers (2021-09-21T17:49:34Z) - MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents.
The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.
We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z) - CDA: a Cost Efficient Content-based Multilingual Web Document Aligner [97.98885151955467]
We introduce a Content-based Document Alignment approach to align multilingual web documents based on content.
We leverage lexical translation models to build vector representations using TF-IDF.
Experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.
arXiv Detail & Related papers (2021-02-20T03:37:23Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer [136.09386219006123]
We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages.
MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning.
arXiv Detail & Related papers (2020-04-30T18:54:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.