Named entity recognition in chemical patents using ensemble of
contextual language models
- URL: http://arxiv.org/abs/2007.12569v2
- Date: Thu, 17 Sep 2020 09:54:53 GMT
- Title: Named entity recognition in chemical patents using ensemble of
contextual language models
- Authors: Jenny Copara and Nona Naderi and Julien Knafou and Patrick Ruch and
Douglas Teodoro
- Abstract summary: We study the effectiveness of contextualized language models to extract information from chemical patents.
Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%.
- Score: 0.3731111830152912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chemical patent documents describe a broad range of applications holding key
reaction and compound information, such as chemical structure, reaction
formulas, and molecular properties. These informational entities should be
first identified in text passages to be utilized in downstream tasks. Text
mining provides means to extract relevant information from chemical patents
through information extraction techniques. As part of the Information
Extraction task of the Cheminformatics Elsevier Melbourne University challenge,
in this work we study the effectiveness of contextualized language models to
extract reaction information in chemical patents. We assess transformer
architectures trained on a generic and specialised corpora to propose a new
ensemble model. Our best model, based on a majority ensemble approach, achieves
an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show
that ensemble of contextualized language models can provide an effective method
to extract information from chemical patents.
Related papers
- Integrating Chemistry Knowledge in Large Language Models via Prompt Engineering [2.140221068402338]
This paper presents a study on the integration of domain-specific knowledge in prompt engineering to enhance the performance of large language models (LLMs) in scientific domains.
A benchmark dataset is curated to the intricate physical-chemical properties of small molecules, their drugability for pharmacology, alongside the functional attributes of enzymes and crystal materials.
The proposed domain-knowledge embedded prompt engineering method outperforms traditional prompt engineering strategies on various metrics.
arXiv Detail & Related papers (2024-04-22T16:55:44Z) - EnzChemRED, a rich enzyme chemistry relation extraction dataset [3.6124226106001]
EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated.
We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text.
We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text.
arXiv Detail & Related papers (2024-04-22T14:18:34Z) - OpenChemIE: An Information Extraction Toolkit For Chemistry Literature [37.23189665773341]
OpenChemIE is a tool for extracting reaction data from chemistry literature.
We employ specialized neural models that address a specific task for chemistry information extraction.
We meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole.
arXiv Detail & Related papers (2024-04-01T20:16:21Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - An Autonomous Large Language Model Agent for Chemical Literature Data
Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature.
Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [27.850325653751078]
structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design.
Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts.
We propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions.
arXiv Detail & Related papers (2023-07-04T02:52:30Z) - Stress Testing BERT Anaphora Resolution Models for Reaction Extraction
in Chemical Patents [7.653466578233261]
In chemical patents, there are five anaphoric relations of interest: co-reference, transformed, reaction associated, work up, and contained.
Our goal is to investigate how the performance of anaphora resolution models for reaction texts differs in a noise-free and noisy environment.
arXiv Detail & Related papers (2023-06-23T09:01:56Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - ChemVise: Maximizing Out-of-Distribution Chemical Detection with the
Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones.
We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.