Related papers: The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

URL: http://arxiv.org/abs/2207.04043v1
Date: Fri, 8 Jul 2022 17:57:15 GMT
Title: The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications
Authors: Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, Stuart M. Shieber
Abstract summary: We introduce the Harvard USPTO Patent dataset (HUPD) With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks.
Score: 8.110699646062384
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.

Related papers

Harnessing Large Language Models for Scientific Novelty Detection [49.10608128661251]
We propose to harness large language models (LLMs) for scientific novelty detection (ND)<n>To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs.<n> Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks.
arXiv Detail & Related papers (2025-05-30T14:08:13Z)
Enriching Patent Claim Generation with European Patent Dataset [2.0270237738043906]
We introduce EPD, a European patent dataset. EPD presents rich textual data and structured metadata to support patent-related tasks, including claim generation.<n>EPD fills a critical gap by providing a benchmark for European patents to enable more comprehensive evaluation.<n>Experiments show that LLMs fine-tuned on EPD significantly outperform those trained on previous datasets and even GPT-4o in claim quality and cross-domain generalization.
arXiv Detail & Related papers (2025-05-18T23:04:49Z)
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures [7.16446145782558]
We introduce PatentDesc-355K, a novel large-scale dataset containing 355K patent figures along with their brief and detailed textual descriptions. We also propose PatentLMM - a novel multimodal large language model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents.
arXiv Detail & Related papers (2025-01-25T04:45:32Z)
EvoPat: A Multi-LLM-based Patents Summarization and Analysis Agent [0.0]
EvoPat is a multi-LLM-based patent agent designed to assist users in analyzing patents through Retrieval-Augmented Generation (RAG) and advanced search strategies. We demonstrate that EvoPat outperforms GPT-4 in tasks such as patent summarization, comparative analysis, and technical evaluation.
arXiv Detail & Related papers (2024-12-24T02:21:09Z)
AutoPatent: A Multi-Agent Framework for Automatic Patent Generation [16.862811929856313]
We introduce a novel and practical task known as Draft2Patent, along with its corresponding D2P benchmark, which challenges Large Language Models to generate full-length patents averaging 17K tokens based on initial drafts. We propose a multi-agent framework called AutoPatent which leverages the LLM-based planner agent, writer agents, and examiner agent with PGTree and RRAG to generate lengthy, intricate, and high-quality complete patent documents.
arXiv Detail & Related papers (2024-12-13T02:27:34Z)
PatentEdits: Framing Patent Novelty as Textual Entailment [62.8514393375952]
We introduce the PatentEdits dataset, which contains 105K examples of successful revisions. We design algorithms to label edits sentence by sentence, then establish how well these edits can be predicted with large language models. We demonstrate that evaluating textual entailment between cited references and draft sentences is especially effective in predicting which inventive claims remained unchanged or are novel in relation to prior art.
arXiv Detail & Related papers (2024-11-20T17:23:40Z)
Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation [13.242188189150987]
We present PAP2PAT, a new challenging benchmark of 1.8k patent-paper pairs with document outlines. Our experiments with current open-weight LLMs and outline-guided generation show that they can effectively use information from the paper but struggle with repetitions, likely due to the inherent repetitiveness of patent language.
arXiv Detail & Related papers (2024-10-09T15:52:48Z)
A Survey on Patent Analysis: From NLP to Multimodal AI [14.090575139188422]
This interdisciplinary survey aims to serve as a comprehensive resource for researchers and practitioners who work at the intersection of NLP, Multimodal AI, and patent analysis.
arXiv Detail & Related papers (2024-04-02T20:44:06Z)
Natural Language Processing in Patents: A Survey [0.0]
Patents, encapsulating crucial technical and legal information, present a rich domain for natural language processing (NLP) applications. As NLP technologies evolve, large language models (LLMs) have demonstrated outstanding capabilities in general text processing and generation tasks. This paper aims to equip NLP researchers with the essential knowledge to navigate this complex domain efficiently.
arXiv Detail & Related papers (2024-03-06T23:17:16Z)
Differentially Private Synthetic Data via Foundation Model APIs 2: Text [56.13240830670327]
A lot of high-quality text data generated in the real world is private and cannot be shared or used freely due to privacy concerns. We propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines.
arXiv Detail & Related papers (2024-03-04T05:57:50Z)
PaECTER: Patent-level Representation Learning using Citation-informed Transformers [0.16785092703248325]
PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain.
arXiv Detail & Related papers (2024-02-29T18:09:03Z)
Leveraging Large Language Models to Improve REST API Testing [51.284096009803406]
RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification. Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation.
arXiv Detail & Related papers (2023-12-01T19:53:23Z)
Unveiling Black-boxes: Explainable Deep Learning Models for Patent Classification [48.5140223214582]
State-of-the-art methods for multi-label patent classification rely on deep opaque neural networks (DNNs) We propose a novel deep explainable patent classification framework by introducing layer-wise relevance propagation (LRP) Considering the relevance score, we then generate explanations by visualizing relevant words for the predicted patent class.
arXiv Detail & Related papers (2023-10-31T14:11:37Z)
Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA) First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z)
Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent Classification [26.85734804493925]
We propose an integrated framework that comprehensively considers the information on patents for patent classification. We first present an IPC codes correlations learning module to derive their semantic representations. Finally, we combine the contextual information of patent texts that contains the semantics of IPC codes, and assignees' sequential preferences to make predictions.
arXiv Detail & Related papers (2023-08-10T07:02:24Z)
Patent Sentiment Analysis to Highlight Patent Paragraphs [0.0]
Given a patent document, identifying distinct semantic annotations is an interesting research aspect. In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice. This work assist patent practitioners in highlighting semantic information automatically and aid to create a sustainable and efficient patent analysis using the aptitude of Machine Learning.
arXiv Detail & Related papers (2021-11-06T13:28:29Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Determinantal Point Processes in Randomized Numerical Linear Algebra [80.27102478796613]
Numerical Linear Algebra (RandNLA) uses randomness to develop improved algorithms for matrix problems that arise in scientific computing, data science, machine learning, etc. Recent work has uncovered deep and fruitful connections between DPPs and RandNLA which lead to new guarantees and improved algorithms.
arXiv Detail & Related papers (2020-05-07T00:39:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.