Repetition Improves Language Model Embeddings
- URL: http://arxiv.org/abs/2402.15449v2
- Date: Sun, 07 Sep 2025 18:50:16 GMT
- Title: Repetition Improves Language Model Embeddings
- Authors: Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan,
- Abstract summary: "echo embeddings" convert autoregressive language models into strong text embedding models without changing the architecture or requiring fine-tuning.<n>Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training.
- Score: 86.71985212601258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bidirectional models are considered essential for strong text embeddings. Recent approaches to adapt autoregressive language models (LMs) into strong text embedding models have largely had the requirement to modify the LM architecture to be bidirectional. We challenge this premise by introducing "echo embeddings" which converts autoregressive LMs into high quality text embedding models without changing the architecture or requiring fine-tuning. By repeating the input and extracting embeddings from the repeated tokens -- which have access to all original tokens -- echo embeddings improve over classical LM embeddings by over 5% in zero-shot settings. Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training. Echo embeddings are also compatible with supervised fine-tuning, matching or outperforming bidirectionally-converted LMs in an apples-to-apples comparison, even with an identical compute budget during training and inference. Overall, repetition is a simple and effective strategy to circumvent the need for bidirectional attention in embedding models, paving the way towards a unified architecture for all NLP tasks.
Related papers
- Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization [35.43577499735611]
We introduce one Parallel Decoupling Framework (PDF) for multimodal embedding learning.<n>PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input.<n>We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark.
arXiv Detail & Related papers (2025-11-03T13:57:08Z) - Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models [3.8688081072587326]
Causal2Vec is a general-purpose embedding model tailored to enhance the performance of decoder-only large language models.<n>We first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token.<n>To mitigate the recency bias by last-token pooling, we introduced the last hidden states of Contextual and EOS tokens as the final text embedding.
arXiv Detail & Related papers (2025-07-31T10:01:11Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Modular Prompt Learning Improves Vision-Language Models [49.132774679968456]
We propose Modular Prompt Learning (MPL) to promote the preservation of information contained in the inserted prompts.<n>On average, our method achieves 0.7% performance gain on the base-to-new generalization task.<n>The largest improvement on the individual dataset is 10.7%.
arXiv Detail & Related papers (2025-02-19T22:00:20Z) - MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems [8.971049629873185]
MTLM is a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives.<n>It supports multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring.<n>Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies.
arXiv Detail & Related papers (2025-02-14T10:21:10Z) - Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies [7.14946066475415]
Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.<n>Existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters.<n>We present three new SD methods that remove this shared-vocabulary constraint.<n>Our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding.
arXiv Detail & Related papers (2025-01-31T19:13:58Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting [0.0]
This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process.
The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification.
arXiv Detail & Related papers (2024-04-18T15:58:56Z) - Generative Representational Instruction Tuning [89.76840377003178]
GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB)
GritLM 8x7B outperforms all open generative language models that we tried while still being among the best embedding models.
arXiv Detail & Related papers (2024-02-15T12:12:19Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Incomplete Utterance Rewriting as Sequential Greedy Tagging [0.0]
We introduce speaker-aware embedding to model speaker variation.
Our model achieves optimal results on all nine restoration scores while having other metric scores comparable to previous state-of-the-art models.
arXiv Detail & Related papers (2023-07-08T04:05:04Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Enhancing Black-Box Few-Shot Text Classification with Prompt-Based Data
Augmentation [42.05617728412819]
We show how to optimize few-shot text classification without accessing the gradients of the large-scale language models.
Our approach, dubbed BT-Classifier, significantly outperforms state-of-the-art black-box few-shot learners.
arXiv Detail & Related papers (2023-05-23T07:54:34Z) - ReGen: Zero-Shot Text Classification via Training Data Generation with
Progressive Dense Retrieval [22.882301169283323]
We propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus.
Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models.
arXiv Detail & Related papers (2023-05-18T04:30:09Z) - RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training
Retrieval-Oriented Language Models [3.4523793651427113]
We propose duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for contextualized embeddings of both [] and ordinary tokens.
DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability.
arXiv Detail & Related papers (2022-11-16T08:57:55Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models
Robust with Little Cost [5.672132510411465]
State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary words.
We follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words.
We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT) and makes it robust to OOV with few additional parameters.
arXiv Detail & Related papers (2022-03-15T13:11:07Z) - Source and Target Bidirectional Knowledge Distillation for End-to-end
Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models.
We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z) - AMR Parsing via Graph-Sequence Iterative Inference [62.85003739964878]
We propose a new end-to-end model that treats AMR parsing as a series of dual decisions on the input sequence and the incrementally constructed graph.
We show that the answers to these two questions are mutually causalities.
We design a model based on iterative inference that helps achieve better answers in both perspectives, leading to greatly improved parsing accuracy.
arXiv Detail & Related papers (2020-04-12T09:15:21Z) - LAVA NAT: A Non-Autoregressive Translation Model with Look-Around
Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass.
These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens.
We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.