Prior Art Search and Reranking for Generated Patent Text
- URL: http://arxiv.org/abs/2009.09132v2
- Date: Sun, 18 Jul 2021 06:07:21 GMT
- Title: Prior Art Search and Reranking for Generated Patent Text
- Authors: Jieh-Sheng Lee and Jieh Hsiang
- Abstract summary: We implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.
To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.
- Score: 1.8275108630751844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models, such as GPT-2, have demonstrated impressive results
recently. A fundamental question we'd like to address is: where did the
generated text come from? This work is our initial effort toward answering the
question by using prior art search. The purpose of the prior art search is to
find the most similar prior text in the training data of GPT-2. We take a
reranking approach and apply it to the patent domain. Specifically, we
pre-train GPT-2 models from scratch by using the patent data from the USPTO.
The input for the prior art search is the patent text generated by the GPT-2
model. We also pre-trained BERT models from scratch for converting patent text
to embeddings. The steps of reranking are: (1) search the most similar text in
the training data of GPT-2 by taking a bag-of-word ranking approach (BM25), (2)
convert the search results in text format to BERT embeddings, and (3) provide
the final result by ranking the BERT embeddings based on their similarities
with the patent text generated by GPT-2. The experiments in this work show that
such reranking is better than ranking with embeddings alone. However, our mixed
results also indicate that calculating the semantic similarities among long
text spans is still challenging. To our knowledge, this work is the first to
implement a reranking system to identify retrospectively the most similar
inputs to a GPT model based on its output.
Related papers
- PaECTER: Patent-level Representation Learning using Citation-informed
Transformers [0.16785092703248325]
PaECTER is a publicly available, open-source document-level encoder specific for patents.
We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents.
PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain.
arXiv Detail & Related papers (2024-02-29T18:09:03Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect
ChatGPT-Generated Text [48.36706154871577]
We introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts)
It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts.
We also propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text.
arXiv Detail & Related papers (2023-07-21T06:38:37Z) - SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its
Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis.
Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification.
The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z) - Collaborative Generative AI: Integrating GPT-k for Efficient Editing in
Text-to-Image Generation [114.80518907146792]
We investigate the potential of utilizing large-scale language models, such as GPT-k, to improve the prompt editing process for text-to-image generation.
We compare the common edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting T2I, and examine factors that may influence this process.
arXiv Detail & Related papers (2023-05-18T21:53:58Z) - Large-Scale Text Analysis Using Generative Language Models: A Case Study
in Discovering Public Value Expressions in AI Patents [2.246222223318928]
This paper employs a novel approach using a generative language model (GPT-4) to produce labels and rationales for large-scale text analysis.
We collect a database comprising 154,934 patent documents using an advanced Boolean query submitted to InnovationQ+.
We design a framework for identifying and labeling public value expressions in these AI patent sentences.
arXiv Detail & Related papers (2023-05-17T17:18:26Z) - Shall We Pretrain Autoregressive Language Models with Retrieval? A
Comprehensive Study [115.96080028033904]
We study a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT.
Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.
arXiv Detail & Related papers (2023-04-13T18:04:19Z) - Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition [14.82259273703819]
We present results using fine-tuned GPT, GPT-2 and their combination for automatic speech recognition (ASR)
A conversion method is proposed to compute the correct language prior probability based on bidirectional LM outputs.
The proposed conversion for language prior probabilities enables BERT to receive an extra 3% relative WERR.
arXiv Detail & Related papers (2021-07-29T16:53:37Z) - BARTScore: Evaluating Generated Text as Text Generation [89.50052670307434]
We conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models.
We operationalize this idea using BART, an encoder-decoder based pre-trained model.
We propose a metric BARTScore with a number of variants that can be flexibly applied to evaluation of text from different perspectives.
arXiv Detail & Related papers (2021-06-22T03:20:53Z) - BERT based patent novelty search by training claims to their own
description [0.0]
We introduce a new scoring scheme, relevance scoring or novelty scoring, to process the output of BERT in a meaningful way.
We tested the method on patent applications by training BERT on the first claims of patents and corresponding descriptions.
BERT's output has been processed according to the relevance score and the results compared with the cited X documents in the search reports.
arXiv Detail & Related papers (2021-03-01T16:54:50Z) - Investigating African-American Vernacular English in Transformer-Based
Text Generation [55.53547556060537]
Social media has encouraged the written use of African American Vernacular English (AAVE)
We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs.
We find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both.
arXiv Detail & Related papers (2020-10-06T06:27:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.