Automated patent extraction powers generative modeling in focused
chemical spaces
- URL: http://arxiv.org/abs/2303.08272v3
- Date: Mon, 24 Jul 2023 14:28:11 GMT
- Title: Automated patent extraction powers generative modeling in focused
chemical spaces
- Authors: Akshay Subramanian, Kevin P. Greenman, Alexis Gervaix, Tzuhsiung Yang,
Rafael G\'omez-Bombarelli
- Abstract summary: Deep generative models have emerged as an exciting avenue for inverse molecular design.
One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels.
We develop an automated pipeline to go from patent digital files to the generation of novel candidates with minimal human intervention.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep generative models have emerged as an exciting avenue for inverse
molecular design, with progress coming from the interplay between training
algorithms and molecular representations. One of the key challenges in their
applicability to materials science and chemistry has been the lack of access to
sizeable training datasets with property labels. Published patents contain the
first disclosure of new materials prior to their publication in journals, and
are a vast source of scientific knowledge that has remained relatively untapped
in the field of data-driven molecular design. Because patents are filed seeking
to protect specific uses, molecules in patents can be considered to be weakly
labeled into application classes. Furthermore, patents published by the US
Patent and Trademark Office (USPTO) are downloadable and have machine-readable
text and molecular structures. In this work, we train domain-specific
generative models using patent data sources by developing an automated pipeline
to go from USPTO patent digital files to the generation of novel candidates
with minimal human intervention. We test the approach on two in-class extracted
datasets, one in organic electronics and another in tyrosine kinase inhibitors.
We then evaluate the ability of generative models trained on these in-class
datasets on two categories of tasks (distribution learning and property
optimization), identify strengths and limitations, and suggest possible
explanations and remedies that could be used to overcome these in practice.
Related papers
- Automated Neural Patent Landscaping in the Small Data Regime [6.284464997330885]
The rapid expansion of patenting activity in recent decades has driven an increasing need for efficient and effective automated patent landscaping approaches.
We present an automated neural patent landscaping system that demonstrates significantly improved performance on difficult examples.
arXiv Detail & Related papers (2024-07-10T19:13:37Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - Unveiling Black-boxes: Explainable Deep Learning Models for Patent
Classification [48.5140223214582]
State-of-the-art methods for multi-label patent classification rely on deep opaque neural networks (DNNs)
We propose a novel deep explainable patent classification framework by introducing layer-wise relevance propagation (LRP)
Considering the relevance score, we then generate explanations by visualizing relevant words for the predicted patent class.
arXiv Detail & Related papers (2023-10-31T14:11:37Z) - Graph Representation Learning Towards Patents Network Analysis [2.202803272456695]
This research employed a graph representation learning approach to create, analyze, and find similarities in the patent data registered in the Iranian Official Gazette.
Key entities were extracted from the scrapped patents dataset to create the Iranian patents graph from scratch.
Thanks to the utilization of novel graph algorithms and text mining methods, we identified new areas of industry and research from Iranian patent data.
arXiv Detail & Related papers (2023-09-25T05:49:40Z) - Structure to Property: Chemical Element Embeddings and a Deep Learning
Approach for Accurate Prediction of Chemical Properties [0.0]
This paper introduces a new machine learning model based on deep learning techniques, such as a multilayer encoder and decoder architecture, for classification tasks.
We demonstrate the opportunities offered by our approach by applying it to various types of input data, including organic and inorganic compounds.
The models used in this work exhibit a high degree of predictive power, underscoring the progress that can be made with refined machine learning.
arXiv Detail & Related papers (2023-09-17T19:41:32Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and
Multi-Purpose Corpus of Patent Applications [8.110699646062384]
We introduce the Harvard USPTO Patent dataset (HUPD)
With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora.
By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks.
arXiv Detail & Related papers (2022-07-08T17:57:15Z) - A Survey on Sentence Embedding Models Performance for Patent Analysis [0.0]
We propose a standard library and dataset for assessing the accuracy of embeddings models based on PatentSBERTa approach.
Results show PatentSBERTa, Bert-for-patents, and TF-IDF Weighted Word Embeddings have the best accuracy for computing sentence embeddings at the subclass level.
arXiv Detail & Related papers (2022-04-28T12:04:42Z) - MONAI Label: A framework for AI-assisted Interactive Labeling of 3D
Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models.
We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.