AutoIE: An Automated Framework for Information Extraction from
Scientific Literature
- URL: http://arxiv.org/abs/2401.16672v1
- Date: Tue, 30 Jan 2024 01:45:03 GMT
- Title: AutoIE: An Automated Framework for Information Extraction from
Scientific Literature
- Authors: Yangyang Liu, Shoubin Li
- Abstract summary: AutoIE is a framework designed to automate the extraction of vital data from scientific PDF documents.
Our SBERT model achieves high Marco F1 scores of 87.19 and 89.65 on CoNLL04 and ADE datasets.
This research paves the way for enhanced data management and interpretation in molecular sieve synthesis.
- Score: 6.235887933544583
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the rapidly evolving field of scientific research, efficiently extracting
key information from the burgeoning volume of scientific papers remains a
formidable challenge. This paper introduces an innovative framework designed to
automate the extraction of vital data from scientific PDF documents, enabling
researchers to discern future research trajectories more readily. AutoIE
uniquely integrates four novel components: (1) A multi-semantic feature
fusion-based approach for PDF document layout analysis; (2) Advanced functional
block recognition in scientific texts; (3) A synergistic technique for
extracting and correlating information on molecular sieve synthesis; (4) An
online learning paradigm tailored for molecular sieve literature. Our SBERT
model achieves high Marco F1 scores of 87.19 and 89.65 on CoNLL04 and ADE
datasets. In addition, a practical application of AutoIE in the petrochemical
molecular sieve synthesis domain demonstrates its efficacy, evidenced by an
impressive 78\% accuracy rate. This research paves the way for enhanced data
management and interpretation in molecular sieve synthesis. It is a valuable
asset for seasoned experts and newcomers in this specialized field.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - An Autonomous Large Language Model Agent for Chemical Literature Data
Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature.
Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - CARE: Extracting Experimental Findings From Clinical Literature [29.763929941107616]
This work presents CARE, a new IE dataset for the task of extracting clinical findings.
We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes.
We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports.
arXiv Detail & Related papers (2023-11-16T10:06:19Z) - PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images.
Our approach fuses image and textual data to enhance the generation process.
We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z) - PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text [1.9573380763700712]
This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus.
We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
arXiv Detail & Related papers (2022-10-22T09:43:54Z) - Text to Insight: Accelerating Organic Materials Knowledge Extraction via
Deep Learning [1.2774526936067927]
This study aims to explore knowledge extraction for organic materials.
We built a research dataset composed of 855 annotated and 708,376 unannotated sentences drawn from 92,667 abstracts.
We used named-entity-recognition (NER) with BiLSTM-CNN-CRF deep learning model to automatically extract key knowledge from literature.
arXiv Detail & Related papers (2021-09-27T01:58:35Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.