Fine-Grained Distillation for Long Document Retrieval
- URL: http://arxiv.org/abs/2212.10423v1
- Date: Tue, 20 Dec 2022 17:00:36 GMT
- Title: Fine-Grained Distillation for Long Document Retrieval
- Authors: Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Guodong Long, Can
Xu, Daxin Jiang
- Abstract summary: Long document retrieval aims to fetch query-relevant documents from a large-scale collection.
Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
- Score: 86.39802110609062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long document retrieval aims to fetch query-relevant documents from a
large-scale collection, where knowledge distillation has become de facto to
improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
However, in contrast to passages or sentences, retrieval on long documents
suffers from the scope hypothesis that a long document may cover multiple
topics. This maximizes their structure heterogeneity and poses a
granular-mismatch issue, leading to an inferior distillation efficacy. In this
work, we propose a new learning framework, fine-grained distillation (FGD), for
long-document retrievers. While preserving the conventional dense retrieval
paradigm, it first produces global-consistent representations crossing
different fine granularity and then applies multi-granular aligned distillation
merely during training. In experiments, we evaluate our framework on two
long-document retrieval benchmarks, which show state-of-the-art performance.
Related papers
- Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER)
DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.
Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Enhanced Retrieval of Long Documents: Leveraging Fine-Grained Block Representations with Large Language Models [24.02950598944251]
We introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents.
Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM.
We aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document.
arXiv Detail & Related papers (2025-01-28T16:03:52Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers.
Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model.
Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.