Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings?
- URL: http://arxiv.org/abs/2304.14796v1
- Date: Fri, 28 Apr 2023 12:11:21 GMT
- Title: Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings?
- Authors: Sonal Sannigrahi, Josef van Genabith, Cristina Espana-Bonet
- Abstract summary: We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
- Score: 18.968571816913208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense vector representations for textual data are crucial in modern NLP. Word
embeddings and sentence embeddings estimated from raw texts are key in
achieving state-of-the-art results in various tasks requiring semantic
understanding. However, obtaining embeddings at the document level is
challenging due to computational requirements and lack of appropriate data.
Instead, most approaches fall back on computing document embeddings based on
sentence representations. Although there exist architectures and models to
encode documents fully, they are in general limited to English and few other
high-resourced languages. In this work, we provide a systematic comparison of
methods to produce document-level representations from sentences based on
LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare
input token number truncation, sentence averaging as well as some simple
windowing and in some cases new augmented and learnable approaches, on 3 multi-
and cross-lingual tasks in 8 languages belonging to 3 different language
families. Our task-based extrinsic evaluations show that, independently of the
language, a clever combination of sentence embeddings is usually better than
encoding the full document as a single unit, even when this is possible. We
demonstrate that while a simple sentence average results in a strong baseline
for classification tasks, more complex combinations are necessary for semantic
tasks.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - A General-Purpose Multilingual Document Encoder [9.868221447090855]
We pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE)
We leverage Wikipedia as a readily available source of comparable documents for creating training data.
We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks.
arXiv Detail & Related papers (2023-05-11T17:55:45Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Training Effective Neural Sentence Encoders from Automatically Mined
Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data.
Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora.
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.