Towards General Text Embeddings with Multi-stage Contrastive Learning
- URL: http://arxiv.org/abs/2308.03281v1
- Date: Mon, 7 Aug 2023 03:52:59 GMT
- Title: Towards General Text Embeddings with Multi-stage Contrastive Learning
- Authors: Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan
Zhang
- Abstract summary: GTE is a general-purpose text embedding model trained with multi-stage contrastive learning.
We train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources.
- Score: 20.803769345818456
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present GTE, a general-purpose text embedding model trained with
multi-stage contrastive learning. In line with recent advancements in unifying
various NLP tasks into a single format, we train a unified text embedding model
by employing contrastive learning over a diverse mixture of datasets from
multiple sources. By significantly increasing the number of training data
during both unsupervised pre-training and supervised fine-tuning stages, we
achieve substantial performance gains over existing embedding models. Notably,
even with a relatively modest parameter count of 110M, GTE$_\text{base}$
outperforms the black-box embedding API provided by OpenAI and even surpasses
10x larger text embedding models on the massive text embedding benchmark.
Furthermore, without additional fine-tuning on each programming language
individually, our model outperforms previous best code retrievers of similar
size by treating code as text. In summary, our model achieves impressive
results by effectively harnessing multi-stage contrastive learning, offering a
powerful and efficient text embedding model with broad applicability across
various NLP and code-related tasks.
Related papers
- Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging [33.23758947497205]
Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks.
To overcome these challenges, we explore model merging-a technique that combines independently trained models to mitigate gradient conflicts and balance data distribution.
We introduce a novel method, Self Positioning, which efficiently searches for optimal model combinations within the space of task vectors using gradient descent.
arXiv Detail & Related papers (2024-10-19T08:39:21Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - Multi-Task Learning for Front-End Text Processing in TTS [15.62497569424995]
We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech front-end.
Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads.
arXiv Detail & Related papers (2024-01-12T02:13:21Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - JOIST: A Joint Speech and Text Streaming Model For ASR [63.15848310748753]
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
arXiv Detail & Related papers (2022-10-13T20:59:22Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Multiple Word Embeddings for Increased Diversity of Representation [15.279850826041066]
We show a technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time.
We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity is the driving force of why this technique works.
arXiv Detail & Related papers (2020-09-30T02:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.