Related papers: Improving Text Embeddings with Large Language Models

Improving Text Embeddings with Large Language Models

URL: http://arxiv.org/abs/2401.00368v3
Date: Fri, 31 May 2024 07:22:01 GMT
Title: Improving Text Embeddings with Large Language Models
Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei,
Abstract summary: We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
Score: 59.930513259982725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

Related papers

Synthetic Text Generation for Training Large Language Models via Gradient Matching [27.74603049449281]
We propose the first theoretically rigorous approach for generating synthetic human-readable text. In doing so, the generated synthetic text can guarantee convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data.
arXiv Detail & Related papers (2025-02-24T19:49:15Z)
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators [61.82799141938912]
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
arXiv Detail & Related papers (2025-02-10T12:30:25Z)
BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation [71.46236155101032]
We propose Base-Refine, a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models. We show that fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.
arXiv Detail & Related papers (2025-02-03T00:12:40Z)
READ: Reinforcement-based Adversarial Learning for Text Classification with Limited Labeled Data [7.152603583363887]
Pre-trained transformer models such as BERT have shown massive gains across many text classification tasks. This paper proposes a method that encapsulates reinforcement learning-based text generation and semi-supervised adversarial learning approaches. Our method READ, Reinforcement-based Adversarial learning, utilizes an unlabeled dataset to generate diverse synthetic text through reinforcement learning.
arXiv Detail & Related papers (2025-01-14T11:39:55Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z)
FuseGen: PLM Fusion for Data-generation based Zero-shot Learning [18.51772808242954]
FuseGen is a novel data generation-based zero-shot learning framework. It introduces a new criteria for subset selection from synthetic datasets. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
arXiv Detail & Related papers (2024-06-18T11:55:05Z)
RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset. We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z)
A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing. It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text. We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z)
TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z)
Towards General Text Embeddings with Multi-stage Contrastive Learning [20.803769345818456]
GTE is a general-purpose text embedding model trained with multi-stage contrastive learning. We train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources.
arXiv Detail & Related papers (2023-08-07T03:52:59Z)
Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields. Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.