Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
- URL: http://arxiv.org/abs/2509.03020v4
- Date: Thu, 09 Oct 2025 07:37:52 GMT
- Title: Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
- Authors: Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin,
- Abstract summary: We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding.<n>This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query)
- Score: 37.53732954585151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
Related papers
- Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models [3.8688081072587326]
Causal2Vec is a general-purpose embedding model tailored to enhance the performance of decoder-only large language models.<n>We first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token.<n>To mitigate the recency bias by last-token pooling, we introduced the last hidden states of Contextual and EOS tokens as the final text embedding.
arXiv Detail & Related papers (2025-07-31T10:01:11Z) - GEM: Empowering LLM for both Embedding Generation and Language Understanding [11.081595808236239]
We propose Generative Embedding large language Model (GEM) to generate high-quality text embeddings.<n>Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask.<n>Our results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance.
arXiv Detail & Related papers (2025-06-04T18:02:07Z) - MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z) - Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective [40.29094043868067]
We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval.<n>Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
arXiv Detail & Related papers (2025-05-21T02:59:14Z) - Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling [69.84963245729826]
We propose an auxiliary task of QL to enhance the backbone for subsequent contrastive learning of the retriever.<n>We introduce our model, which incorporates two key components: Attention Block (AB) and Document Corruption (DC)
arXiv Detail & Related papers (2025-04-07T16:03:59Z) - Enhancing Lexicon-Based Text Embeddings with Large Language Models [19.91595650613768]
Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks.<n>LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies.<n>LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB)
arXiv Detail & Related papers (2025-01-16T18:57:20Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - Making Large Language Models A Better Foundation For Dense Retrieval [19.38740248464456]
Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document.
It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding.
We propose LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of dense retrieval application.
arXiv Detail & Related papers (2023-12-24T15:10:35Z) - Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task.
We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus.
Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z) - ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for
Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning.
We propose a simple but effective in-context learning framework called ICL-D3IE.
Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.