LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models
- URL: http://arxiv.org/abs/2106.03379v1
- Date: Mon, 7 Jun 2021 07:14:00 GMT
- Title: LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models
- Authors: Hongyu Gong, Vishrav Chaudhary, Yuqing Tang, Francisco Guzm\'an
- Abstract summary: Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
- Score: 8.745407715423992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual document representations enable language understanding in
multilingual contexts and allow transfer learning from high-resource to
low-resource languages at the document level. Recently large pre-trained
language models such as BERT, XLM and XLM-RoBERTa have achieved great success
when fine-tuned on sentence-level downstream tasks. It is tempting to apply
these cross-lingual models to document representation learning. However, there
are two challenges: (1) these models impose high costs on long document
processing and thus many of them have strict length limit; (2) model
fine-tuning requires extra data and computational resources, which is not
practical in resource-limited settings. In this work, we address these
challenges by proposing unsupervised Language-Agnostic Weighted Document
Representations (LAWDR). We study the geometry of pre-trained sentence
embeddings and leverage it to derive document representations without
fine-tuning. Evaluated on cross-lingual document alignment, LAWDR demonstrates
comparable performance to state-of-the-art models on benchmark datasets.
Related papers
- Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model [1.3812010983144798]
This paper shows that we can leverage a large number of annotation-free legal documents without Chinese word segmentation to fine-tune a large-scale language model.
It can also achieve the generating legal document drafts task, and at the same time achieve the protection of information privacy and to improve information security issues.
arXiv Detail & Related papers (2024-06-06T16:00:20Z) - One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support [18.810320088441678]
This work introduces a novel NLP benchmark for the legal domain.
It challenges LLMs in five key dimensions: processing emphlong documents (up to 50K tokens), using emphdomain-specific knowledge (embodied in legal texts) and emphmultilingual understanding (covering five languages)
Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system.
arXiv Detail & Related papers (2023-06-15T16:19:15Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents.
We propose a simple and effective pre-training objective - Document Reordering Machine Translation.
DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.