Related papers: Adapting Learned Sparse Retrieval for Long Documents

Adapting Learned Sparse Retrieval for Long Documents

URL: http://arxiv.org/abs/2305.18494v1
Date: Mon, 29 May 2023 13:50:16 GMT
Title: Adapting Learned Sparse Retrieval for Long Documents
Authors: Thong Nguyen, Sean MacAvaney and Andrew Yates
Abstract summary: Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary. While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents. We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is crucial for LSR to handle long documents.
Score: 23.844134960568976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary. While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents. We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is crucial for LSR to handle long documents. To leverage this property, we proposed two adaptations of the Sequential Dependence Model (SDM) to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term dependence, while SoftSDM uses potential functions that model the dependence of query terms and their expansion terms (i.e., terms identified using a transformer's masked language modeling head). Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches for different document length constraints. Surprisingly, SoftSDM does not provide any performance benefits over ExactSDM. This suggests that soft proximity matching is not necessary for modeling term dependence in LSR. Overall, this study provides insights into handling long documents with LSR, proposing adaptations that improve its performance.

Related papers

On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents [2.186901738997927]
We reproduce and examine the mechanisms of adapting Learned Sparse Retrieval (LSR) for long documents. Our experiments confirmed the importance of specific segments, with the first segment consistently dominating document retrieval performance. We re-evaluated recently proposed methods -- ExactSDM and SoftSDM -- across varying document lengths.
arXiv Detail & Related papers (2025-03-31T08:19:31Z)
DReSD: Dense Retrieval for Speculative Decoding [8.220217498103315]
Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Dretrieval for Speculative Decoding (DReSD) is a novel framework that uses approximate nearest neighbour search with contextualised token embeddings.
arXiv Detail & Related papers (2025-02-21T16:32:28Z)
Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER) DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing [70.35888047551643]
We present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. We find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks.
arXiv Detail & Related papers (2025-02-14T08:04:22Z)
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding [103.69014172427026]
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. We present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG) which can broaden horizons of any MLLM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z)
Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback [17.986392250269606]
We introduce Real Document Embeddings from Relevance Feedback (ReDE-RF) Inspired by relevance feedback, ReDE-RF proposes to re-frame hypothetical document generation as a relevance estimation task. Our experiments show that ReDE-RF consistently surpasses state-of-the-art zero-shot dense retrieval methods.
arXiv Detail & Related papers (2024-10-28T17:40:40Z)
Towards Scalable Semantic Representation for Recommendation [65.06144407288127]
Mixture-of-Codes is proposed to construct semantic IDs based on large language models (LLMs) Our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.
arXiv Detail & Related papers (2024-10-12T15:10:56Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
Beyond Inter-Item Relations: Dynamic Adaption for Enhancing LLM-Based Sequential Recommendation [83.87767101732351]
Sequential recommender systems (SRS) predict the next items that users may prefer based on user historical interaction sequences. Inspired by the rise of large language models (LLMs) in various AI applications, there is a surge of work on LLM-based SRS. We propose DARec, a sequential recommendation model built on top of coarse-grained adaption for capturing inter-item relations.
arXiv Detail & Related papers (2024-08-14T10:03:40Z)
DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering [4.364937306005719]
RAG has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA) We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. A two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers.
arXiv Detail & Related papers (2024-06-11T15:15:33Z)
Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility. We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity. Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
ODSum: New Benchmarks for Open Domain Multi-Document Summarization [30.875191848268347]
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets.
arXiv Detail & Related papers (2023-09-16T11:27:34Z)
Synergistic Interplay between Search and Large Language Models for Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections. InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)
Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs) query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets. Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z)
SeDR: Segment Representation Learning for Long Documents Dense Retrieval [17.864362372788374]
We propose Segment representation learning for long documents Dense Retrieval (SeDR) SeDR encodes long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling. Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models.
arXiv Detail & Related papers (2022-11-20T01:28:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.