Fine-Grained Distillation for Long Document Retrieval
- URL: http://arxiv.org/abs/2212.10423v1
- Date: Tue, 20 Dec 2022 17:00:36 GMT
- Title: Fine-Grained Distillation for Long Document Retrieval
- Authors: Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Guodong Long, Can
Xu, Daxin Jiang
- Abstract summary: Long document retrieval aims to fetch query-relevant documents from a large-scale collection.
Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
- Score: 86.39802110609062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long document retrieval aims to fetch query-relevant documents from a
large-scale collection, where knowledge distillation has become de facto to
improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
However, in contrast to passages or sentences, retrieval on long documents
suffers from the scope hypothesis that a long document may cover multiple
topics. This maximizes their structure heterogeneity and poses a
granular-mismatch issue, leading to an inferior distillation efficacy. In this
work, we propose a new learning framework, fine-grained distillation (FGD), for
long-document retrievers. While preserving the conventional dense retrieval
paradigm, it first produces global-consistent representations crossing
different fine granularity and then applies multi-granular aligned distillation
merely during training. In experiments, we evaluate our framework on two
long-document retrieval benchmarks, which show state-of-the-art performance.
Related papers
- DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Natural Logic-guided Autoregressive Multi-hop Document Retrieval for
Fact Verification [21.04611844009438]
We propose a novel retrieve-and-rerank method for multi-hop retrieval.
It consists of a retriever that jointly scores documents in the knowledge source and sentences from previously retrieved documents.
It is guided by a proof system that dynamically terminates the retrieval process if the evidence is deemed sufficient.
arXiv Detail & Related papers (2022-12-10T11:32:38Z) - SeDR: Segment Representation Learning for Long Documents Dense Retrieval [17.864362372788374]
We propose Segment representation learning for long documents Dense Retrieval (SeDR)
SeDR encodes long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling.
Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models.
arXiv Detail & Related papers (2022-11-20T01:28:44Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers.
Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model.
Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.