SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
- URL: http://arxiv.org/abs/2508.01959v1
- Date: Sun, 03 Aug 2025 23:59:31 GMT
- Title: SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
- Authors: Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu,
- Abstract summary: We show how to represent short chunks in a way that is conditioned on a broader context window to enhance retrieval performance.<n>Existing embedding models are not well-equipped to encode such situated context effectively.<n>Our method substantially outperforms state-of-the-art embedding models.
- Score: 77.93156509994994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.
Related papers
- Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings [25.966475857117175]
In this work, we introduce ConTEB, a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context.<n>Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required.<n>We propose InSeNT, a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning.
arXiv Detail & Related papers (2025-05-30T16:43:28Z) - Dewey Long Context Embedding Model: A Technical Report [0.0]
dewey_en_beta is a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark.<n>This report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model.
arXiv Detail & Related papers (2025-03-26T09:55:00Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Two are better than one: Context window extension with multi-grained self-injection [111.1376461868317]
SharedLLM is a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval.
We introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.
arXiv Detail & Related papers (2024-10-25T06:08:59Z) - Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models [5.330795983408874]
We introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text.<n>The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks.
arXiv Detail & Related papers (2024-09-07T03:54:46Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT [48.35407228760352]
Retrieval pipelines perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text.
We develop long-context retrieval encoders suitable for these domains.
We first introduce LoCoV1, a novel 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective.
We next present the M2-BERT retrieval encoder, an 80M parameter state-space encoder model built from the Monarch Mixer architecture, capable of scaling to documents up to 32K tokens long.
arXiv Detail & Related papers (2024-02-12T06:43:52Z) - Dynamic Retrieval-Augmented Generation [4.741884506444161]
We propose a novel approach for the Dynamic Retrieval-Augmented Generation (DRAG)
DRAG injects compressed embeddings of the retrieved entities into the generative model.
Our approach achieves several targets: (1) lifting the length limitations of the context window, saving on the prompt size; (2) allowing huge expansion of the number of retrieval entities available for the context; (3) alleviating the problem of misspelling or failing to find relevant entity names.
arXiv Detail & Related papers (2023-12-14T14:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.