SeDR: Segment Representation Learning for Long Documents Dense Retrieval
- URL: http://arxiv.org/abs/2211.10841v1
- Date: Sun, 20 Nov 2022 01:28:44 GMT
- Title: SeDR: Segment Representation Learning for Long Documents Dense Retrieval
- Authors: Junying Chen, Qingcai Chen, Dongfang Li, Yutao Huang
- Abstract summary: We propose Segment representation learning for long documents Dense Retrieval (SeDR)
SeDR encodes long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling.
Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models.
- Score: 17.864362372788374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Dense Retrieval (DR) has become a promising solution to document
retrieval, where document representations are used to perform effective and
efficient semantic search. However, DR remains challenging on long documents,
due to the quadratic complexity of its Transformer-based encoder and the finite
capacity of a low-dimension embedding. Current DR models use suboptimal
strategies such as truncating or splitting-and-pooling to long documents
leading to poor utilization of whole document information. In this work, to
tackle this problem, we propose Segment representation learning for long
documents Dense Retrieval (SeDR). In SeDR, Segment-Interaction Transformer is
proposed to encode long documents into document-aware and segment-sensitive
representations, while it holds the complexity of splitting-and-pooling and
outperforms other segment-interaction patterns on DR. Since GPU memory
requirements for long document encoding causes insufficient negatives for DR
training, Late-Cache Negative is further proposed to provide additional cache
negatives for optimizing representation learning. Experiments on MS MARCO and
TREC-DL datasets show that SeDR achieves superior performance among DR models,
and confirm the effectiveness of SeDR on long document retrieval.
Related papers
- Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility.
We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity.
Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z) - ODSum: New Benchmarks for Open Domain Multi-Document Summarization [30.875191848268347]
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries.
We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets.
arXiv Detail & Related papers (2023-09-16T11:27:34Z) - Adapting Learned Sparse Retrieval for Long Documents [23.844134960568976]
Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary.
While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents.
We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is crucial for LSR to handle long documents.
arXiv Detail & Related papers (2023-05-29T13:50:16Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Fine-Grained Distillation for Long Document Retrieval [86.39802110609062]
Long document retrieval aims to fetch query-relevant documents from a large-scale collection.
Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
arXiv Detail & Related papers (2022-12-20T17:00:36Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - SDR: Efficient Neural Re-ranking using Succinct Document Representation [4.9278175139681215]
We propose the Succinct Document Representation scheme that computes emphhighly compressed intermediate document representations.
Our method is highly efficient, achieving 4x-11.6x better compression rates for the same ranking quality.
arXiv Detail & Related papers (2021-10-03T07:43:16Z) - Denoising Relation Extraction from Document-level Distant Supervision [92.76441007250197]
We propose a novel pre-trained model for DocRE, which denoises the document-level DS data via multiple pre-training tasks.
Experimental results on the large-scale DocRE benchmark show that our model can capture useful information from noisy DS data and achieve promising results.
arXiv Detail & Related papers (2020-11-08T02:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.