Related papers: Named entity recognition in chemical patents using ensemble of contextual language models

Named entity recognition in chemical patents using ensemble of contextual language models

URL: http://arxiv.org/abs/2007.12569v2
Date: Thu, 17 Sep 2020 09:54:53 GMT
Title: Named entity recognition in chemical patents using ensemble of contextual language models
Authors: Jenny Copara and Nona Naderi and Julien Knafou and Patrick Ruch and Douglas Teodoro
Abstract summary: We study the effectiveness of contextualized language models to extract information from chemical patents. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%.
Score: 0.3731111830152912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elsevier Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We assess transformer architectures trained on a generic and specialised corpora to propose a new ensemble model. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show that ensemble of contextualized language models can provide an effective method to extract information from chemical patents.

Related papers

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature [55.66036140125613]
This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers.<n>AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field.
arXiv Detail & Related papers (2026-02-10T04:30:11Z)
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning [51.393018266721576]
We propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP)<n>Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem.<n>We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image.
arXiv Detail & Related papers (2025-11-04T09:08:44Z)
A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature [8.306442315850878]
We develop a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction.<n>Our system achieved an F1 score of 80.8% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature.
arXiv Detail & Related papers (2025-07-27T11:16:57Z)
ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
Chemical knowledge-informed framework for privacy-aware retrosynthesis learning [60.93245342663455]
Current machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models.
arXiv Detail & Related papers (2025-02-26T13:13:24Z)
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild [23.78185449646608]
We present Mol, a novel end-to-end optical chemical structure recognition method. We use a SMILES encoding rule to annotate Mol-7M, the largest annotated molecular image dataset. We trained an end-to-end molecular image captioning model, Mol, using a curriculum learning approach.
arXiv Detail & Related papers (2024-11-17T15:00:09Z)
BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction [65.93303145891628]
BatGPT-Chem is a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Our model captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions. This development empowers chemists to adeptly address novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science.
arXiv Detail & Related papers (2024-08-19T05:17:40Z)
EnzChemRED, a rich enzyme chemistry relation extraction dataset [3.6124226106001]
EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated. We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text.
arXiv Detail & Related papers (2024-04-22T14:18:34Z)
OpenChemIE: An Information Extraction Toolkit For Chemistry Literature [37.23189665773341]
OpenChemIE is a tool for extracting reaction data from chemistry literature. We employ specialized neural models that address a specific task for chemistry information extraction. We meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole.
arXiv Detail & Related papers (2024-04-01T20:16:21Z)
An Autonomous Large Language Model Agent for Chemical Literature Data Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [27.850325653751078]
structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. We propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions.
arXiv Detail & Related papers (2023-07-04T02:52:30Z)
Stress Testing BERT Anaphora Resolution Models for Reaction Extraction in Chemical Patents [7.653466578233261]
In chemical patents, there are five anaphoric relations of interest: co-reference, transformed, reaction associated, work up, and contained. Our goal is to investigate how the performance of anaphora resolution models for reaction texts differs in a noise-free and noisy environment.
arXiv Detail & Related papers (2023-06-23T09:01:56Z)
Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules. To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z)
ChemVise: Maximizing Out-of-Distribution Chemical Detection with the Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones. We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.