English Contrastive Learning Can Learn Universal Cross-lingual Sentence
Embeddings
- URL: http://arxiv.org/abs/2211.06127v1
- Date: Fri, 11 Nov 2022 11:17:56 GMT
- Title: English Contrastive Learning Can Learn Universal Cross-lingual Sentence
Embeddings
- Authors: Yau-Shian Wang and Ashley Wu and Graham Neubig
- Abstract summary: Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space.
In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data.
- Score: 77.94885131732119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Universal cross-lingual sentence embeddings map semantically similar
cross-lingual sentences into a shared embedding space. Aligning cross-lingual
sentence embeddings usually requires supervised cross-lingual parallel
sentences. In this work, we propose mSimCSE, which extends SimCSE to
multilingual settings and reveal that contrastive learning on English data can
surprisingly learn high-quality universal cross-lingual sentence embeddings
without any parallel data. In unsupervised and weakly supervised settings,
mSimCSE significantly improves previous sentence embedding methods on
cross-lingual retrieval and multilingual STS tasks. The performance of
unsupervised mSimCSE is comparable to fully supervised methods in retrieving
low-resource languages and multilingual STS. The performance can be further
enhanced when cross-lingual NLI data is available. Our code is publicly
available at https://github.com/yaushian/mSimCSE.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - EASE: Entity-Aware Contrastive Learning of Sentence Embedding [37.7055989762122]
EASE is a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities.
We show that EASE exhibits competitive or better performance in English semantic textual similarity (STS) and short text clustering (STC) tasks.
arXiv Detail & Related papers (2022-05-09T13:22:44Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Syntax-augmented Multilingual BERT for Cross-lingual Transfer [37.99210035238424]
This work shows that explicitly providing language syntax and training mBERT helps cross-lingual transfer.
Experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks.
arXiv Detail & Related papers (2021-06-03T21:12:50Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.