Related papers: Fine-Grained Distillation for Long Document Retrieval

Fine-Grained Distillation for Long Document Retrieval

URL: http://arxiv.org/abs/2212.10423v1
Date: Tue, 20 Dec 2022 17:00:36 GMT
Title: Fine-Grained Distillation for Long Document Retrieval
Authors: Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Guodong Long, Can Xu, Daxin Jiang
Abstract summary: Long document retrieval aims to fetch query-relevant documents from a large-scale collection. Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
Score: 86.39802110609062
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the scope hypothesis that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.

Related papers

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence [56.09494651178128]
Retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG) We show that retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. We show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs.
arXiv Detail & Related papers (2025-03-06T23:23:13Z)
Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER) DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Enhanced Retrieval of Long Documents: Leveraging Fine-Grained Block Representations with Large Language Models [24.02950598944251]
We introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents. Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM. We aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document.
arXiv Detail & Related papers (2025-01-28T16:03:52Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
Natural Logic-guided Autoregressive Multi-hop Document Retrieval for Fact Verification [21.04611844009438]
We propose a novel retrieve-and-rerank method for multi-hop retrieval. It consists of a retriever that jointly scores documents in the knowledge source and sentences from previously retrieved documents. It is guided by a proof system that dynamically terminates the retrieval process if the evidence is deemed sufficient.
arXiv Detail & Related papers (2022-12-10T11:32:38Z)
SeDR: Segment Representation Learning for Long Documents Dense Retrieval [17.864362372788374]
We propose Segment representation learning for long documents Dense Retrieval (SeDR) SeDR encodes long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling. Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models.
arXiv Detail & Related papers (2022-11-20T01:28:44Z)
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers. Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model. Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.