Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search
- URL: http://arxiv.org/abs/2502.12974v1
- Date: Tue, 18 Feb 2025 15:56:34 GMT
- Title: Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search
- Authors: Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, Maosong Sun,
- Abstract summary: Deliberate Thinking based Dense Retriever (DEBATER)<n>DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.<n> Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
- Score: 65.53881294642451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent dense retrievers usually thrive on the emergency capabilities of Large Language Models (LLMs), using them to encode queries and documents into an embedding space for retrieval. These LLM-based dense retrievers have shown promising performance across various retrieval scenarios. However, relying on a single embedding to represent documents proves less effective in capturing different perspectives of documents for matching. In this paper, we propose Deliberate Thinking based Dense Retriever (DEBATER), which enhances these LLM-based retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. DEBATER introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations using a continuous chain of thought. To consolidate information from various thinking steps, DEBATER also incorporates the Self Distillation mechanism, which identifies the most informative thinking steps and integrates them into a unified text embedding. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes are available at https://github.com/OpenBMB/DEBATER.
Related papers
- Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval [36.38599923075882]
DIVER is a retrieval pipeline designed for reasoning-intensive information retrieval.<n>It consists of four components: the document preprocessing stage, the query expansion stage, the retrieval stage and the reranking stage.<n>On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 45.8 overall and 28.9 on original queries, consistently outperforming competitive reasoning-aware models.
arXiv Detail & Related papers (2025-08-11T13:57:49Z) - A Unified Retrieval Framework with Document Ranking and EDU Filtering for Multi-document Summarization [18.13855430873805]
Current methods apply truncation after the retrieval process to fit the context length.
We propose a novel retrieval-based framework that integrates query selection and document ranking.
We evaluate our framework on multiple MDS datasets, demonstrating consistent improvements in ROUGE metrics.
arXiv Detail & Related papers (2025-04-23T13:41:10Z) - Enhanced Retrieval of Long Documents: Leveraging Fine-Grained Block Representations with Large Language Models [24.02950598944251]
We introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents.<n>Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM.<n>We aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document.
arXiv Detail & Related papers (2025-01-28T16:03:52Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z) - Fine-Grained Distillation for Long Document Retrieval [86.39802110609062]
Long document retrieval aims to fetch query-relevant documents from a large-scale collection.
Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
arXiv Detail & Related papers (2022-12-20T17:00:36Z) - Natural Logic-guided Autoregressive Multi-hop Document Retrieval for
Fact Verification [21.04611844009438]
We propose a novel retrieve-and-rerank method for multi-hop retrieval.
It consists of a retriever that jointly scores documents in the knowledge source and sentences from previously retrieved documents.
It is guided by a proof system that dynamically terminates the retrieval process if the evidence is deemed sufficient.
arXiv Detail & Related papers (2022-12-10T11:32:38Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - CODER: An efficient framework for improving retrieval through
COntextualized Document Embedding Reranking [11.635294568328625]
We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost.
It utilizes precomputed document representations extracted by a base dense retrieval method.
It incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method.
arXiv Detail & Related papers (2021-12-16T10:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.