Dual-Alignment Pre-training for Cross-lingual Sentence Embedding
- URL: http://arxiv.org/abs/2305.09148v1
- Date: Tue, 16 May 2023 03:53:30 GMT
- Title: Dual-Alignment Pre-training for Cross-lingual Sentence Embedding
- Authors: Ziheng Li, Shaohan Huang, Zihan Zhang, Zhi-Hong Deng, Qiang Lou,
Haizhen Huang, Jian Jiao, Furu Wei, Weiwei Deng, Qi Zhang
- Abstract summary: We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
- Score: 79.98111074307657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown that dual encoder models trained with the
sentence-level translation ranking task are effective methods for cross-lingual
sentence embedding. However, our research indicates that token-level alignment
is also crucial in multilingual scenarios, which has not been fully explored
previously. Based on our findings, we propose a dual-alignment pre-training
(DAP) framework for cross-lingual sentence embedding that incorporates both
sentence-level and token-level alignment. To achieve this, we introduce a novel
representation translation learning (RTL) task, where the model learns to use
one-side contextualized token representation to reconstruct its translation
counterpart. This reconstruction objective encourages the model to embed
translation information into the token representation. Compared to other
token-level alignment methods such as translation language modeling, RTL is
more suitable for dual encoder architectures and is computationally efficient.
Extensive experiments on three sentence-level cross-lingual benchmarks
demonstrate that our approach can significantly improve sentence embedding. Our
code is available at https://github.com/ChillingDream/DAP.
Related papers
- Sequence Shortening for Context-Aware Machine Translation [5.803309695504831]
We show that a special case of multi-encoder architecture achieves higher accuracy on contrastive datasets.
We introduce two novel methods - Latent Grouping and Latent Selecting, where the network learns to group tokens or selects the tokens to be cached as context.
arXiv Detail & Related papers (2024-02-02T13:55:37Z) - Translate-Distill: Learning Cross-Language Dense Retrieval by
Translation and Distillation [17.211592060717713]
This paper proposes Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model.
This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR.
arXiv Detail & Related papers (2024-01-09T20:40:49Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Lightweight Cross-Lingual Sentence Representation Learning [57.9365829513914]
We introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations.
We propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task.
arXiv Detail & Related papers (2021-05-28T14:10:48Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Context-aware Decoder for Neural Machine Translation using a Target-side
Document-Level Language Model [12.543106304662059]
We present a method to turn a sentence-level translation model into a context-aware model by incorporating a document-level language model into the decoder.
Our decoder is built upon only a sentence-level parallel corpora and monolingual corpora.
In a theoretical viewpoint, the core part of this work is the novel representation of contextual information using point-wise mutual information between context and the current sentence.
arXiv Detail & Related papers (2020-10-24T08:06:18Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.