Lessons from Deep Learning applied to Scholarly Information Extraction:
What Works, What Doesn't, and Future Directions
- URL: http://arxiv.org/abs/2207.04029v1
- Date: Fri, 8 Jul 2022 17:37:56 GMT
- Title: Lessons from Deep Learning applied to Scholarly Information Extraction:
What Works, What Doesn't, and Future Directions
- Authors: Raquib Bin Yousuf, Subhodip Biswas, Kulendra Kumar Kaushal, James
Dunham, Rebecca Gelles, Sathappan Muthiah, Nathan Self, Patrick Butler, Naren
Ramakrishnan
- Abstract summary: We show how EneRex can extract key insights from a large-scale dataset in the domain of computer science.
We highlight how the existing datasets are limited in their capacity and how EneRex may fit into an existing knowledge graph.
- Score: 12.62863659147376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding key insights from full-text scholarly articles is essential as
it enables us to determine interesting trends, give insight into the research
and development, and build knowledge graphs. However, some of the interesting
key insights are only available when considering full-text. Although
researchers have made significant progress in information extraction from short
documents, extraction of scientific entities from full-text scholarly
literature remains a challenging problem. This work presents an automated
End-to-end Research Entity Extractor called EneRex to extract technical facets
such as dataset usage, objective task, method from full-text scholarly research
articles. Additionally, we extracted three novel facets, e.g., links to source
code, computing resources, programming language/libraries from full-text
articles. We demonstrate how EneRex is able to extract key insights and trends
from a large-scale dataset in the domain of computer science. We further test
our pipeline on multiple datasets and found that the EneRex improves upon a
state of the art model. We highlight how the existing datasets are limited in
their capacity and how EneRex may fit into an existing knowledge graph. We also
present a detailed discussion with pointers for future research. Our code and
data are publicly available at
https://github.com/DiscoveryAnalyticsCenter/EneRex.
Related papers
- MatViX: Multimodal Information Extraction from Visually Rich Articles [6.349779979863784]
In materials science, extracting structured information from research articles can accelerate the discovery of new materials.
We introduce textscMatViX, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured files.
These files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE.
arXiv Detail & Related papers (2024-10-27T16:13:58Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science [4.120803087965204]
This paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections.
Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation.
arXiv Detail & Related papers (2023-03-03T20:31:04Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - MORTY: Structured Summarization for Targeted Information Extraction from
Scholarly Articles [0.0]
We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles.
Our approach condenses the article's full-text to property-value pairs as a segmented text snippet called structured summary.
We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles.
arXiv Detail & Related papers (2022-12-11T06:49:29Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - How to Train Your Agent to Read and Write [52.24605794920856]
Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master.
It would be fascinating if we could train an intelligent agent to help people read and summarize papers, and perhaps even discover and exploit the potential knowledge clues to write novel papers.
We propose a Deep ReAder-Writer (DRAW) network, which consists of a textitReader that can extract knowledge graphs (KGs) from input paragraphs and discover potential knowledge, a graph-to-text textitWriter that generates a novel paragraph, and a textit
arXiv Detail & Related papers (2021-01-04T12:22:04Z) - Generating Knowledge Graphs by Employing Natural Language Processing and
Machine Learning Techniques within the Scholarly Domain [1.9004296236396943]
We present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications.
Within this research work, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools.
We generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain.
arXiv Detail & Related papers (2020-10-28T08:31:40Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.