Repetition Improves Language Model Embeddings
- URL: http://arxiv.org/abs/2402.15449v1
- Date: Fri, 23 Feb 2024 17:25:10 GMT
- Title: Repetition Improves Language Model Embeddings
- Authors: Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig,
Aditi Raghunathan
- Abstract summary: We propose "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence.
On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned.
- Score: 68.92976440181387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent approaches to improving the extraction of text embeddings from
autoregressive large language models (LLMs) have largely focused on
improvements to data, backbone pretrained language models, or improving
task-differentiation via instructions. In this work, we address an
architectural limitation of autoregressive models: token embeddings cannot
contain information from tokens that appear later in the input. To address this
limitation, we propose a simple approach, "echo embeddings," in which we repeat
the input twice in context and extract embeddings from the second occurrence.
We show that echo embeddings of early tokens can encode information about later
tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On
the MTEB leaderboard, echo embeddings improve over classical embeddings by over
9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a
Mistral-7B model achieve state-of-the-art compared to prior open source models
that do not leverage synthetic fine-tuning data.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting [0.0]
This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process.
The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification.
arXiv Detail & Related papers (2024-04-18T15:58:56Z) - Generative Representational Instruction Tuning [89.76840377003178]
GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB)
GritLM 8x7B outperforms all open generative language models that we tried while still being among the best embedding models.
arXiv Detail & Related papers (2024-02-15T12:12:19Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Incomplete Utterance Rewriting as Sequential Greedy Tagging [0.0]
We introduce speaker-aware embedding to model speaker variation.
Our model achieves optimal results on all nine restoration scores while having other metric scores comparable to previous state-of-the-art models.
arXiv Detail & Related papers (2023-07-08T04:05:04Z) - Enhancing Black-Box Few-Shot Text Classification with Prompt-Based Data
Augmentation [42.05617728412819]
We show how to optimize few-shot text classification without accessing the gradients of the large-scale language models.
Our approach, dubbed BT-Classifier, significantly outperforms state-of-the-art black-box few-shot learners.
arXiv Detail & Related papers (2023-05-23T07:54:34Z) - ReGen: Zero-Shot Text Classification via Training Data Generation with
Progressive Dense Retrieval [22.882301169283323]
We propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus.
Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models.
arXiv Detail & Related papers (2023-05-18T04:30:09Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models
Robust with Little Cost [5.672132510411465]
State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary words.
We follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words.
We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT) and makes it robust to OOV with few additional parameters.
arXiv Detail & Related papers (2022-03-15T13:11:07Z) - AMR Parsing via Graph-Sequence Iterative Inference [62.85003739964878]
We propose a new end-to-end model that treats AMR parsing as a series of dual decisions on the input sequence and the incrementally constructed graph.
We show that the answers to these two questions are mutually causalities.
We design a model based on iterative inference that helps achieve better answers in both perspectives, leading to greatly improved parsing accuracy.
arXiv Detail & Related papers (2020-04-12T09:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.