CSFCube -- A Test Collection of Computer Science Research Articles for
Faceted Query by Example
- URL: http://arxiv.org/abs/2103.12906v1
- Date: Wed, 24 Mar 2021 01:02:12 GMT
- Title: CSFCube -- A Test Collection of Computer Science Research Articles for
Faceted Query by Example
- Authors: Sheshera Mysore, Tim O'Gorman, Andrew McCallum, Hamed Zamani
- Abstract summary: We introduce the task of faceted Query by Example.
Users can also specify a finer grained aspect in addition to the input query document.
We envision models which are able to retrieve scientific papers analogous to a query scientific paper.
- Score: 43.01717754418893
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Query by Example is a well-known information retrieval task in which a
document is chosen by the user as the search query and the goal is to retrieve
relevant documents from a large collection. However, a document often covers
multiple aspects of a topic. To address this scenario we introduce the task of
faceted Query by Example in which users can also specify a finer grained aspect
in addition to the input query document. We focus on the application of this
task in scientific literature search. We envision models which are able to
retrieve scientific papers analogous to a query scientific paper along
specifically chosen rhetorical structure elements as one solution to this
problem. In this work, the rhetorical structure elements, which we refer to as
facets, indicate "background", "method", or "result" aspects of a scientific
paper. We introduce and describe an expert annotated test collection to
evaluate models trained to perform this task. Our test collection consists of a
diverse set of 50 query documents, drawn from computational linguistics and
machine learning venues. We carefully followed the annotation guideline used by
TREC for depth-k pooling (k = 100 or 250) and the resulting data collection
consists of graded relevance scores with high annotation agreement. The data is
freely available for research purposes.
Related papers
- BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
Many complex real-world queries require in-depth reasoning to identify relevant documents.
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.
arXiv Detail & Related papers (2024-07-16T17:58:27Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set
Operations [36.70770411188946]
QUEST is a dataset of 3357 natural language queries with implicit set operations.
The dataset challenges models to match multiple constraints mentioned in queries with corresponding evidence in documents.
We analyze several modern retrieval systems, finding that they often struggle on such queries.
arXiv Detail & Related papers (2023-05-19T14:19:32Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - Cross-document Event Coreference Search: Task, Dataset and Modeling [26.36068336169796]
We propose an appealing, and often more applicable, complementary set up for the task - Cross-document Coreference Search.
To support research on this task, we create a corresponding dataset, which is derived from Wikipedia.
We present a novel model that integrates a powerful coreference scoring scheme into the DPR architecture, yielding improved performance.
arXiv Detail & Related papers (2022-10-23T08:21:25Z) - One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text [12.98328149016239]
We propose MONOMER as a one-shot snippet task to find snippets in target documents.
We conduct experiments showing MONOMER outperforms several baselines from oneshot- template-LM.
We train MONOMER on.
generated data having many visually similar query detection data.
arXiv Detail & Related papers (2022-09-12T19:26:32Z) - Aspect-Oriented Summarization through Query-Focused Extraction [23.62412515574206]
Real users' needs often fall more closely into aspects, broad topics in a dataset the user is interested in rather than specific queries.
We benchmark extractive query-focused training schemes, and propose a contrastive augmentation approach to train the model.
We evaluate on two aspect-oriented datasets and find this approach yields focused summaries, better than those from a generic summarization system.
arXiv Detail & Related papers (2021-10-15T18:06:21Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.