The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and
Multi-Purpose Corpus of Patent Applications
- URL: http://arxiv.org/abs/2207.04043v1
- Date: Fri, 8 Jul 2022 17:57:15 GMT
- Title: The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and
Multi-Purpose Corpus of Patent Applications
- Authors: Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke
Kominers, Stuart M. Shieber
- Abstract summary: We introduce the Harvard USPTO Patent dataset (HUPD)
With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora.
By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks.
- Score: 8.110699646062384
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Innovation is a major driver of economic and social development, and
information about many kinds of innovation is embedded in semi-structured data
from patents and patent applications. Although the impact and novelty of
innovations expressed in patent data are difficult to measure through
traditional means, ML offers a promising set of techniques for evaluating
novelty, summarizing contributions, and embedding semantics. In this paper, we
introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale,
well-structured, and multi-purpose corpus of English-language patent
applications filed to the United States Patent and Trademark Office (USPTO)
between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two
to three times larger than comparable corpora. Unlike previously proposed
patent datasets in NLP, HUPD contains the inventor-submitted versions of patent
applications--not the final versions of granted patents--thereby allowing us to
study patentability at the time of filing using NLP methods for the first time.
It is also novel in its inclusion of rich structured metadata alongside the
text of patent filings: By providing each application's metadata along with all
of its text fields, the dataset enables researchers to perform new sets of NLP
tasks that leverage variation in structured covariates. As a case study on the
types of research HUPD makes possible, we introduce a new task to the NLP
community--namely, binary classification of patent decisions. We additionally
show the structured metadata provided in the dataset enables us to conduct
explicit studies of concept shifts for this task. Finally, we demonstrate how
HUPD can be used for three additional tasks: multi-class classification of
patent subject areas, language modeling, and summarization.
Related papers
- PatentEdits: Framing Patent Novelty as Textual Entailment [62.8514393375952]
We introduce the PatentEdits dataset, which contains 105K examples of successful revisions.
We design algorithms to label edits sentence by sentence, then establish how well these edits can be predicted with large language models.
We demonstrate that evaluating textual entailment between cited references and draft sentences is especially effective in predicting which inventive claims remained unchanged or are novel in relation to prior art.
arXiv Detail & Related papers (2024-11-20T17:23:40Z) - Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation [13.242188189150987]
We present PAP2PAT, a new challenging benchmark of 1.8k patent-paper pairs with document outlines.
Our experiments with current open-weight LLMs and outline-guided generation show that they can effectively use information from the paper but struggle with repetitions, likely due to the inherent repetitiveness of patent language.
arXiv Detail & Related papers (2024-10-09T15:52:48Z) - Natural Language Processing in Patents: A Survey [0.0]
Patents, encapsulating crucial technical and legal information, present a rich domain for natural language processing (NLP) applications.
As NLP technologies evolve, large language models (LLMs) have demonstrated outstanding capabilities in general text processing and generation tasks.
This paper aims to equip NLP researchers with the essential knowledge to navigate this complex domain efficiently.
arXiv Detail & Related papers (2024-03-06T23:17:16Z) - Differentially Private Synthetic Data via Foundation Model APIs 2: Text [56.13240830670327]
A lot of high-quality text data generated in the real world is private and cannot be shared or used freely due to privacy concerns.
We propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text.
Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines.
arXiv Detail & Related papers (2024-03-04T05:57:50Z) - PaECTER: Patent-level Representation Learning using Citation-informed
Transformers [0.16785092703248325]
PaECTER is a publicly available, open-source document-level encoder specific for patents.
We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents.
PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain.
arXiv Detail & Related papers (2024-02-29T18:09:03Z) - Leveraging Large Language Models to Improve REST API Testing [51.284096009803406]
RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification.
Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation.
arXiv Detail & Related papers (2023-12-01T19:53:23Z) - Unveiling Black-boxes: Explainable Deep Learning Models for Patent
Classification [48.5140223214582]
State-of-the-art methods for multi-label patent classification rely on deep opaque neural networks (DNNs)
We propose a novel deep explainable patent classification framework by introducing layer-wise relevance propagation (LRP)
Considering the relevance score, we then generate explanations by visualizing relevant words for the predicted patent class.
arXiv Detail & Related papers (2023-10-31T14:11:37Z) - Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent Classification [26.85734804493925]
We propose an integrated framework that comprehensively considers the information on patents for patent classification.
We first present an IPC codes correlations learning module to derive their semantic representations.
Finally, we combine the contextual information of patent texts that contains the semantics of IPC codes, and assignees' sequential preferences to make predictions.
arXiv Detail & Related papers (2023-08-10T07:02:24Z) - Patent Sentiment Analysis to Highlight Patent Paragraphs [0.0]
Given a patent document, identifying distinct semantic annotations is an interesting research aspect.
In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice.
This work assist patent practitioners in highlighting semantic information automatically and aid to create a sustainable and efficient patent analysis using the aptitude of Machine Learning.
arXiv Detail & Related papers (2021-11-06T13:28:29Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Determinantal Point Processes in Randomized Numerical Linear Algebra [80.27102478796613]
Numerical Linear Algebra (RandNLA) uses randomness to develop improved algorithms for matrix problems that arise in scientific computing, data science, machine learning, etc.
Recent work has uncovered deep and fruitful connections between DPPs and RandNLA which lead to new guarantees and improved algorithms.
arXiv Detail & Related papers (2020-05-07T00:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.