CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization
- URL: http://arxiv.org/abs/2006.09595v1
- Date: Wed, 17 Jun 2020 01:32:48 GMT
- Title: CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization
- Authors: Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng
Yin, Dragomir Radev, Richard Socher
- Abstract summary: CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature.
To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations.
We evaluate our system on the data of the TREC-COVID information retrieval challenge.
- Score: 53.67205506042232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The COVID-19 global pandemic has resulted in international efforts to
understand, track, and mitigate the disease, yielding a significant corpus of
COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As
of May 2020, 128,000 coronavirus-related publications have been collected
through the COVID-19 Open Research Dataset Challenge. Here we present
CO-Search, a retriever-ranker semantic search engine designed to handle complex
queries over the COVID-19 literature, potentially aiding overburdened health
workers in finding scientific answers during a time of crisis. The retriever is
built from a Siamese-BERT encoder that is linearly composed with a TF-IDF
vectorizer, and reciprocal-rank fused with a BM25 vectorizer. The ranker is
composed of a multi-hop question-answering module, that together with a
multi-paragraph abstractive summarizer adjust retriever scores. To account for
the domain-specific and relatively limited dataset, we generate a bipartite
graph of document paragraphs and citations, creating 1.3 million (citation
title, paragraph) tuples for training the encoder. We evaluate our system on
the data of the TREC-COVID information retrieval challenge. CO-Search obtains
top performance on the datasets of the first and second rounds, across several
key metrics: normalized discounted cumulative gain, precision, mean average
precision, and binary preference.
Related papers
- $\texttt{MixGR}$: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity [88.78750571970232]
This paper introduces $texttMixGR$, which improves dense retrievers' awareness of query-document matching.
$texttMixGR$ fuses various metrics based on granularities to a united score that reflects a comprehensive query-document similarity.
arXiv Detail & Related papers (2024-07-15T13:04:09Z) - Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration [60.535793237063885]
The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet.
The impact of this surge in AIGC on Information Retrieval systems remains an open question.
We introduce Cocktail, a benchmark tailored for evaluating IR models in this mixed-sourced data landscape.
arXiv Detail & Related papers (2024-05-26T12:30:20Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - COVID-19 Literature Mining and Retrieval using Text Mining Approaches [0.0]
The novel coronavirus disease (COVID-19) began in Wuhan, China, in late 2019 and to date has infected over 148M people worldwide.
Many academicians and researchers started to publish papers describing the latest discoveries on covid-19.
The proposed model attempts to extract relavent titles from the large corpus of research publications.
arXiv Detail & Related papers (2022-05-29T22:34:19Z) - Unsupervised Text Mining of COVID-19 Records [0.0]
Twitter as a powerful tool can help researchers measure public health in response to COVID-19.
This paper preprocessed the existing medical dataset regarding COVID-19 named CORD-19 and annotated the dataset for supervised classification tasks.
arXiv Detail & Related papers (2021-09-08T05:57:22Z) - COVID-19 Multidimensional Kaggle Literature Organization [3.201839066679614]
We show that factorization is a powerful unsupervised learning method capable of discovering hidden patterns in a document corpus.
We show that a higher-order representation of the corpus allows for the simultaneous grouping of similar articles, relevant journals, authors with similar research interests, and topic keywords.
arXiv Detail & Related papers (2021-07-17T06:16:36Z) - Multistage BiCross Encoder: Team GATE Entry for MLIA Multilingual
Semantic Search Task 2 [6.229830820553111]
We present a search system called Multistage BiCross, developed by team GATE for the MLIA task 2 Multilingual Semantic Search.
The results of round 1 show that our models achieve state-of-the-art performance for all ranking metrics for both monolingual and bilingual runs.
arXiv Detail & Related papers (2021-01-08T13:59:26Z) - Repurposing TREC-COVID Annotations to Answer the Key Questions of
CORD-19 [4.847073702809032]
coronavirus disease 2019 (COVID-19) began in Wuhan, China in late 2019 and to date has infected over 14M people worldwide.
White House aggregated over 200,000 journal articles related to a variety of coronaviruses and tasked the community with answering key questions related to the corpus.
We set out to repurpose the relevancy annotations for TREC-COVID tasks to identify journal articles in CORD-19 which are relevant to the key questions posed by CORD-19.
arXiv Detail & Related papers (2020-08-27T19:51:07Z) - CAiRE-COVID: A Question Answering and Query-focused Multi-Document
Summarization System for COVID-19 Scholarly Information Management [48.251211691263514]
We present CAiRE-COVID, a real-time question answering (QA) and multi-document summarization system, which won one of the 10 tasks in the Kaggle COVID-19 Open Research dataset Challenge.
Our system aims to tackle the recent challenge of mining the numerous scientific articles being published on COVID-19 by answering high priority questions from the community.
arXiv Detail & Related papers (2020-05-04T15:07:27Z) - Rapidly Bootstrapping a Question Answering Dataset for COVID-19 [88.86456834766288]
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19.
This is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available.
arXiv Detail & Related papers (2020-04-23T17:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.