Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
- URL: http://arxiv.org/abs/2402.17016v1
- Date: Mon, 26 Feb 2024 20:53:12 GMT
- Title: Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
- Authors: Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram,
Andreas Koukounas, Michael G\"unther, Georgios Mastrapas, Vinit Ravishankar,
Joan Fontanals Mart\'inez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil
Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, Nan Wang, Han Xiao
- Abstract summary: We introduce a novel suite of state-of-the-art bilingual text embedding models.
These models are capable of processing lengthy text inputs with up to 8192 tokens.
We have significantly improved the model performance on STS tasks.
We have expanded the Massive Text Embedding Benchmark to include benchmarks for German and Spanish embedding models.
- Score: 22.71166607645311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel suite of state-of-the-art bilingual text embedding
models that are designed to support English and another target language. These
models are capable of processing lengthy text inputs with up to 8192 tokens,
making them highly versatile for a range of natural language processing tasks
such as text retrieval, clustering, and semantic textual similarity (STS)
calculations.
By focusing on bilingual models and introducing a unique multi-task learning
objective, we have significantly improved the model performance on STS tasks,
which outperforms the capabilities of existing multilingual models in both
target language understanding and cross-lingual evaluation tasks. Moreover, our
bilingual models are more efficient, requiring fewer parameters and less memory
due to their smaller vocabulary needs. Furthermore, we have expanded the
Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and
Spanish embedding models. This integration aims to stimulate further research
and advancement in text embedding technologies for these languages.
Related papers
- Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - Accelerating Multilingual Language Model for Excessively Tokenized Languages [3.5570874721859016]
tokenizers in large language models (LLMs) often fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages.
We introduce a simple yet effective framework to accelerate text generation in such languages.
arXiv Detail & Related papers (2024-01-19T12:26:57Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.