Automated patent extraction powers generative modeling in focused
chemical spaces
- URL: http://arxiv.org/abs/2303.08272v3
- Date: Mon, 24 Jul 2023 14:28:11 GMT
- Title: Automated patent extraction powers generative modeling in focused
chemical spaces
- Authors: Akshay Subramanian, Kevin P. Greenman, Alexis Gervaix, Tzuhsiung Yang,
Rafael G\'omez-Bombarelli
- Abstract summary: Deep generative models have emerged as an exciting avenue for inverse molecular design.
One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels.
We develop an automated pipeline to go from patent digital files to the generation of novel candidates with minimal human intervention.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep generative models have emerged as an exciting avenue for inverse
molecular design, with progress coming from the interplay between training
algorithms and molecular representations. One of the key challenges in their
applicability to materials science and chemistry has been the lack of access to
sizeable training datasets with property labels. Published patents contain the
first disclosure of new materials prior to their publication in journals, and
are a vast source of scientific knowledge that has remained relatively untapped
in the field of data-driven molecular design. Because patents are filed seeking
to protect specific uses, molecules in patents can be considered to be weakly
labeled into application classes. Furthermore, patents published by the US
Patent and Trademark Office (USPTO) are downloadable and have machine-readable
text and molecular structures. In this work, we train domain-specific
generative models using patent data sources by developing an automated pipeline
to go from USPTO patent digital files to the generation of novel candidates
with minimal human intervention. We test the approach on two in-class extracted
datasets, one in organic electronics and another in tyrosine kinase inhibitors.
We then evaluate the ability of generative models trained on these in-class
datasets on two categories of tasks (distribution learning and property
optimization), identify strengths and limitations, and suggest possible
explanations and remedies that could be used to overcome these in practice.
Related papers
- Intelligent System for Automated Molecular Patent Infringement Assessment [38.48937966447085]
PatentFinder is a novel multi-agent and tool-enhanced intelligence system that can accurately and comprehensively evaluate small molecules for patent infringement.
PatentFinder features five specialized agents that collaboratively analyze patent claims and molecular structures.
PatentFinder autonomously generates detailed and interpretable patent infringement reports, showcasing enhanced accuracy and improved interpretability.
arXiv Detail & Related papers (2024-12-10T12:14:38Z) - Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation [13.242188189150987]
We present PAP2PAT, a new challenging benchmark of 1.8k patent-paper pairs with document outlines.
Our experiments with current open-weight LLMs and outline-guided generation show that they can effectively use information from the paper but struggle with repetitions, likely due to the inherent repetitiveness of patent language.
arXiv Detail & Related papers (2024-10-09T15:52:48Z) - ClaimCompare: A Data Pipeline for Evaluation of Novelty Destroying Patent Pairs [2.60235825984014]
We introduce a novel data pipeline, ClaimCompare, designed to generate labeled patent claim datasets suitable for training IR and ML models.
To the best of our knowledge, ClaimCompare is the first pipeline that can generate multiple novelty destroying patent datasets.
arXiv Detail & Related papers (2024-07-16T21:38:45Z) - Automated Neural Patent Landscaping in the Small Data Regime [6.284464997330885]
The rapid expansion of patenting activity in recent decades has driven an increasing need for efficient and effective automated patent landscaping approaches.
We present an automated neural patent landscaping system that demonstrates significantly improved performance on difficult examples.
arXiv Detail & Related papers (2024-07-10T19:13:37Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - Unveiling Black-boxes: Explainable Deep Learning Models for Patent
Classification [48.5140223214582]
State-of-the-art methods for multi-label patent classification rely on deep opaque neural networks (DNNs)
We propose a novel deep explainable patent classification framework by introducing layer-wise relevance propagation (LRP)
Considering the relevance score, we then generate explanations by visualizing relevant words for the predicted patent class.
arXiv Detail & Related papers (2023-10-31T14:11:37Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and
Multi-Purpose Corpus of Patent Applications [8.110699646062384]
We introduce the Harvard USPTO Patent dataset (HUPD)
With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora.
By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks.
arXiv Detail & Related papers (2022-07-08T17:57:15Z) - MONAI Label: A framework for AI-assisted Interactive Labeling of 3D
Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models.
We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.