Adapting Learned Sparse Retrieval for Long Documents
- URL: http://arxiv.org/abs/2305.18494v1
- Date: Mon, 29 May 2023 13:50:16 GMT
- Title: Adapting Learned Sparse Retrieval for Long Documents
- Authors: Thong Nguyen, Sean MacAvaney and Andrew Yates
- Abstract summary: Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary.
While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents.
We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is crucial for LSR to handle long documents.
- Score: 23.844134960568976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learned sparse retrieval (LSR) is a family of neural retrieval methods that
transform queries and documents into sparse weight vectors aligned with a
vocabulary. While LSR approaches like Splade work well for short passages, it
is unclear how well they handle longer documents. We investigate existing
aggregation approaches for adapting LSR to longer documents and find that
proximal scoring is crucial for LSR to handle long documents. To leverage this
property, we proposed two adaptations of the Sequential Dependence Model (SDM)
to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term
dependence, while SoftSDM uses potential functions that model the dependence of
query terms and their expansion terms (i.e., terms identified using a
transformer's masked language modeling head).
Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate
that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches
for different document length constraints. Surprisingly, SoftSDM does not
provide any performance benefits over ExactSDM. This suggests that soft
proximity matching is not necessary for modeling term dependence in LSR.
Overall, this study provides insights into handling long documents with LSR,
proposing adaptations that improve its performance.
Related papers
- Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback [17.986392250269606]
We introduce Real Document Embeddings from Relevance Feedback (ReDE-RF)
Inspired by relevance feedback, ReDE-RF proposes to re-frame hypothetical document generation as a relevance estimation task.
Our experiments show that ReDE-RF consistently surpasses state-of-the-art zero-shot dense retrieval methods.
arXiv Detail & Related papers (2024-10-28T17:40:40Z) - Towards Scalable Semantic Representation for Recommendation [65.06144407288127]
Mixture-of-Codes is proposed to construct semantic IDs based on large language models (LLMs)
Our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.
arXiv Detail & Related papers (2024-10-12T15:10:56Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Beyond Inter-Item Relations: Dynamic Adaption for Enhancing LLM-Based Sequential Recommendation [83.87767101732351]
Sequential recommender systems (SRS) predict the next items that users may prefer based on user historical interaction sequences.
Inspired by the rise of large language models (LLMs) in various AI applications, there is a surge of work on LLM-based SRS.
We propose DARec, a sequential recommendation model built on top of coarse-grained adaption for capturing inter-item relations.
arXiv Detail & Related papers (2024-08-14T10:03:40Z) - DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering [4.364937306005719]
RAG has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA)
We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query.
A two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers.
arXiv Detail & Related papers (2024-06-11T15:15:33Z) - Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility.
We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity.
Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - ODSum: New Benchmarks for Open Domain Multi-Document Summarization [30.875191848268347]
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries.
We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets.
arXiv Detail & Related papers (2023-09-16T11:27:34Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z) - SeDR: Segment Representation Learning for Long Documents Dense Retrieval [17.864362372788374]
We propose Segment representation learning for long documents Dense Retrieval (SeDR)
SeDR encodes long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling.
Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models.
arXiv Detail & Related papers (2022-11-20T01:28:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.