Improving Text Embeddings with Large Language Models
- URL: http://arxiv.org/abs/2401.00368v3
- Date: Fri, 31 May 2024 07:22:01 GMT
- Title: Improving Text Embeddings with Large Language Models
- Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei,
- Abstract summary: We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
- Score: 59.930513259982725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - FuseGen: PLM Fusion for Data-generation based Zero-shot Learning [18.51772808242954]
FuseGen is a novel data generation-based zero-shot learning framework.
It introduces a new criteria for subset selection from synthetic datasets.
The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
arXiv Detail & Related papers (2024-06-18T11:55:05Z) - RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Towards General Text Embeddings with Multi-stage Contrastive Learning [20.803769345818456]
GTE is a general-purpose text embedding model trained with multi-stage contrastive learning.
We train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources.
arXiv Detail & Related papers (2023-08-07T03:52:59Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.