Large-Scale Knowledge Synthesis and Complex Information Retrieval from
Biomedical Documents
- URL: http://arxiv.org/abs/2302.06854v1
- Date: Tue, 14 Feb 2023 06:03:43 GMT
- Title: Large-Scale Knowledge Synthesis and Complex Information Retrieval from
Biomedical Documents
- Authors: Shreya Saxena, Raj Sangani, Siva Prasad, Shubham Kumar, Mihir Athale,
Rohan Awhad, Vishal Vaddina
- Abstract summary: Recent advances in the healthcare industry have led to an abundance of unstructured data.
Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents.
- Score: 0.33249867230903685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in the healthcare industry have led to an abundance of
unstructured data, making it challenging to perform tasks such as efficient and
accurate information retrieval at scale. Our work offers an all-in-one scalable
solution for extracting and exploring complex information from large-scale
research documents, which would otherwise be tedious. First, we briefly explain
our knowledge synthesis process to extract helpful information from
unstructured text data of research documents. Then, on top of the knowledge
extracted from the documents, we perform complex information retrieval using
three major components- Paragraph Retrieval, Triplet Retrieval from Knowledge
Graphs, and Complex Question Answering (QA). These components combine lexical
and semantic-based methods to retrieve paragraphs and triplets and perform
faceted refinement for filtering these search results. The complexity of
biomedical queries and documents necessitates using a QA system capable of
handling queries more complex than factoid queries, which we evaluate
qualitatively on the COVID-19 Open Research Dataset (CORD-19) to demonstrate
the effectiveness and value-add.
Related papers
- RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs [12.846097618151951]
We develop a dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage.
We synthesize realistic user queries that integrate diverse topological structures, annotated information, and complex textual descriptions.
We introduce an enhanced Monte Carlo Tree Search (CTS) method, which automatically extracts relational path information from textual graphs for specific queries.
arXiv Detail & Related papers (2024-10-17T19:33:37Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - A Question Answering Based Pipeline for Comprehensive Chinese EHR
Information Extraction [3.411065529290054]
We propose a novel approach that automatically generates training data for transfer learning of question answering models.
Our pipeline incorporates a preprocessing module to handle challenges posed by extraction types.
The obtained QA model exhibits excellent performance on subtasks of information extraction in EHRs.
arXiv Detail & Related papers (2024-02-17T02:55:35Z) - QuOTeS: Query-Oriented Technical Summarization [0.2936007114555107]
We propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references.
QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents.
The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete.
arXiv Detail & Related papers (2023-06-20T18:43:24Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Query-Specific Knowledge Graphs for Complex Finance Topics [6.599344783327053]
We focus on the CODEC dataset, where domain experts create challenging questions.
We show that state-of-the-art ranking systems have headroom for improvement.
We demonstrate that entity and document relevance are positively correlated.
arXiv Detail & Related papers (2022-11-08T10:21:13Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - CAiRE-COVID: A Question Answering and Query-focused Multi-Document
Summarization System for COVID-19 Scholarly Information Management [48.251211691263514]
We present CAiRE-COVID, a real-time question answering (QA) and multi-document summarization system, which won one of the 10 tasks in the Kaggle COVID-19 Open Research dataset Challenge.
Our system aims to tackle the recent challenge of mining the numerous scientific articles being published on COVID-19 by answering high priority questions from the community.
arXiv Detail & Related papers (2020-05-04T15:07:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.