Improving Text Embeddings with Large Language Models
- URL: http://arxiv.org/abs/2401.00368v3
- Date: Fri, 31 May 2024 07:22:01 GMT
- Title: Improving Text Embeddings with Large Language Models
- Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei,
- Abstract summary: We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
- Score: 59.930513259982725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
Related papers
- FuseGen: PLM Fusion for Data-generation based Zero-shot Learning [18.51772808242954]
FuseGen is a novel data generation-based zero-shot learning framework.
It introduces a new criteria for subset selection from synthetic datasets.
The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
arXiv Detail & Related papers (2024-06-18T11:55:05Z) - DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation [20.703102374139537]
We propose a novel text dataset distillation approach called Distilling dataset into Language Model (DiLM)
DiLM trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples.
Our code will be available at https://github.com/arumaekawa/DiLM.
arXiv Detail & Related papers (2024-03-30T06:40:54Z) - RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - TarGEN: Targeted Data Generation with Large Language Models [54.1093098278564]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Towards General Text Embeddings with Multi-stage Contrastive Learning [20.803769345818456]
GTE is a general-purpose text embedding model trained with multi-stage contrastive learning.
We train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources.
arXiv Detail & Related papers (2023-08-07T03:52:59Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.