KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery
- URL: http://arxiv.org/abs/2406.00008v2
- Date: Mon, 17 Jun 2024 10:23:46 GMT
- Title: KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery
- Authors: Shinnosuke Tanaka, James Barry, Vishnudev Kuruvanthodi, Movina Moses, Maxwell J. Giammona, Nathan Herr, Mohab Elkaref, Geeth De Mel,
- Abstract summary: This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline.
This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations.
A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology.
A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data.
- Score: 1.6080795642111267
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline. This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations. An ontology can then be constructed where a user defines the types of entities and relationships they want to capture. A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology. Named Entity Recognition (NER) and Relation Classification (RC) models can be trained on the resulting annotations and can be used to annotate the unannotated portion of the documents. A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data. Furthermore, we integrate a suite of Large Language Models (LLMs) that can be used for QA and summarisation that is grounded in the included documents via a retrieval component. KnowledgeHub is a unique tool that supports annotation, IE and QA, which gives the user full insight into the knowledge discovery pipeline.
Related papers
- Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents [4.298545628576284]
We introduce DANIEL (Document Attention Network for Information Extraction and Labelling), a fully end-to-end architecture for handwritten document understanding.
DANIEL performs layout recognition, handwriting recognition, and named entity recognition on full-page documents.
It can simultaneously learn across multiple languages, layouts, and tasks.
arXiv Detail & Related papers (2024-07-12T09:09:56Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - DocTr: Document Transformer for Structured Information Extraction in
Documents [36.1145541816468]
We present a new formulation for structured information extraction from visually rich documents.
It aims to address the limitations of existing IOB tagging or graph-based formulations.
We represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words.
arXiv Detail & Related papers (2023-07-16T02:59:30Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - KnowGL: Knowledge Generation and Linking from Text [13.407149206621828]
We propose KnowGL, a tool that allows converting text into structured relational data represented as a set of ABox assertions.
We address this problem as a sequence generation task by leveraging pre-trained sequence-to-sequence language models, e.g. BART.
To showcase the capabilities of our tool, we build a web application consisting of a set of UI widgets that help users to navigate through the semantic data extracted from a given input text.
arXiv Detail & Related papers (2022-10-25T12:12:36Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Dependently Typed Knowledge Graphs [4.157595789003928]
We show how standardized semantic web technologies (RDF and its query language SPARQL) can be reproduced in a unified manner with dependent type theory.
In addition to providing the basic functionalities of knowledge graphs, dependent types add expressiveness in encoding both entities and queries.
arXiv Detail & Related papers (2020-03-08T14:04:23Z) - Kleister: A novel task for Information Extraction involving Long
Documents with Complex Layout [5.8530995077744645]
We introduce a new task (named Kleister) with two new datasets.
An NLP system must find the most important information, about various types of entities, in long formal documents.
We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures.
arXiv Detail & Related papers (2020-03-04T22:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.