A Neural Model for Joint Document and Snippet Ranking in Question
Answering for Large Document Collections
- URL: http://arxiv.org/abs/2106.08908v1
- Date: Wed, 16 Jun 2021 16:04:19 GMT
- Title: A Neural Model for Joint Document and Snippet Ranking in Question
Answering for Large Document Collections
- Authors: Dimitris Pappas and Ion Androutsopoulos
- Abstract summary: We present an architecture for joint document and snippet ranking.
The architecture is general and can be used with any neural text relevance ranker.
Experiments on biomedical data from BIOASQ show that our joint models vastly outperform the pipelines in snippet retrieval.
- Score: 9.503056487990959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Question answering (QA) systems for large document collections typically use
pipelines that (i) retrieve possibly relevant documents, (ii) re-rank them,
(iii) rank paragraphs or other snippets of the top-ranked documents, and (iv)
select spans of the top-ranked snippets as exact answers. Pipelines are
conceptually simple, but errors propagate from one component to the next,
without later components being able to revise earlier decisions. We present an
architecture for joint document and snippet ranking, the two middle stages,
which leverages the intuition that relevant documents have good snippets and
good snippets come from relevant documents. The architecture is general and can
be used with any neural text relevance ranker. We experiment with two main
instantiations of the architecture, based on POSIT-DRMM (PDRMM) and a
BERT-based ranker. Experiments on biomedical data from BIOASQ show that our
joint models vastly outperform the pipelines in snippet retrieval, the main
goal for QA, with fewer trainable parameters, also remaining competitive in
document retrieval. Furthermore, our joint PDRMM-based model is competitive
with BERT-based models, despite using orders of magnitude fewer parameters.
These claims are also supported by human evaluation on two test batches of
BIOASQ. To test our key findings on another dataset, we modified the Natural
Questions dataset so that it can also be used for document and snippet
retrieval. Our joint PDRMM-based model again outperforms the corresponding
pipeline in snippet retrieval on the modified Natural Questions dataset, even
though it performs worse than the pipeline in document retrieval. We make our
code and the modified Natural Questions dataset publicly available.
Related papers
- List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction [28.205723817300576]
Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents.
Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE)
This paper introduces a novel framework, PEneo, which performs document pair extraction in a unified pipeline.
arXiv Detail & Related papers (2024-01-07T12:48:07Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Improving Query Representations for Dense Retrieval with Pseudo
Relevance Feedback [29.719150565643965]
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval.
ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels.
Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
arXiv Detail & Related papers (2021-08-30T18:10:26Z) - Literature Retrieval for Precision Medicine with Neural Matching and
Faceted Summarization [2.978663539080876]
We present a document reranking approach that combines neural query-document matching and text summarization.
Evaluations using NIST's TREC-PM track datasets show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-17T02:01:32Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.