Related papers: Prior Art Search and Reranking for Generated Patent Text

Prior Art Search and Reranking for Generated Patent Text

URL: http://arxiv.org/abs/2009.09132v2
Date: Sun, 18 Jul 2021 06:07:21 GMT
Title: Prior Art Search and Reranking for Generated Patent Text
Authors: Jieh-Sheng Lee and Jieh Hsiang
Abstract summary: We implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.
Score: 1.8275108630751844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative models, such as GPT-2, have demonstrated impressive results recently. A fundamental question we'd like to address is: where did the generated text come from? This work is our initial effort toward answering the question by using prior art search. The purpose of the prior art search is to find the most similar prior text in the training data of GPT-2. We take a reranking approach and apply it to the patent domain. Specifically, we pre-train GPT-2 models from scratch by using the patent data from the USPTO. The input for the prior art search is the patent text generated by the GPT-2 model. We also pre-trained BERT models from scratch for converting patent text to embeddings. The steps of reranking are: (1) search the most similar text in the training data of GPT-2 by taking a bag-of-word ranking approach (BM25), (2) convert the search results in text format to BERT embeddings, and (3) provide the final result by ranking the BERT embeddings based on their similarities with the patent text generated by GPT-2. The experiments in this work show that such reranking is better than ranking with embeddings alone. However, our mixed results also indicate that calculating the semantic similarities among long text spans is still challenging. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.

Related papers

Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
PaECTER: Patent-level Representation Learning using Citation-informed Transformers [0.16785092703248325]
PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain.
arXiv Detail & Related papers (2024-02-29T18:09:03Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text [48.36706154871577]
We introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts) It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts. We also propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text.
arXiv Detail & Related papers (2023-07-21T06:38:37Z)
SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis. Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification. The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z)
Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation [114.80518907146792]
We investigate the potential of utilizing large-scale language models, such as GPT-k, to improve the prompt editing process for text-to-image generation. We compare the common edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting T2I, and examine factors that may influence this process.
arXiv Detail & Related papers (2023-05-18T21:53:58Z)
Large-Scale Text Analysis Using Generative Language Models: A Case Study in Discovering Public Value Expressions in AI Patents [2.246222223318928]
This paper employs a novel approach using a generative language model (GPT-4) to produce labels and rationales for large-scale text analysis. We collect a database comprising 154,934 patent documents using an advanced Boolean query submitted to InnovationQ+. We design a framework for identifying and labeling public value expressions in these AI patent sentences.
arXiv Detail & Related papers (2023-05-17T17:18:26Z)
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study [115.96080028033904]
We study a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.
arXiv Detail & Related papers (2023-04-13T18:04:19Z)
Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition [14.82259273703819]
We present results using fine-tuned GPT, GPT-2 and their combination for automatic speech recognition (ASR) A conversion method is proposed to compute the correct language prior probability based on bidirectional LM outputs. The proposed conversion for language prior probabilities enables BERT to receive an extra 3% relative WERR.
arXiv Detail & Related papers (2021-07-29T16:53:37Z)
BARTScore: Evaluating Generated Text as Text Generation [89.50052670307434]
We conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. We operationalize this idea using BART, an encoder-decoder based pre-trained model. We propose a metric BARTScore with a number of variants that can be flexibly applied to evaluation of text from different perspectives.
arXiv Detail & Related papers (2021-06-22T03:20:53Z)
BERT based patent novelty search by training claims to their own description [0.0]
We introduce a new scoring scheme, relevance scoring or novelty scoring, to process the output of BERT in a meaningful way. We tested the method on patent applications by training BERT on the first claims of patents and corresponding descriptions. BERT's output has been processed according to the relevance score and the results compared with the cited X documents in the search reports.
arXiv Detail & Related papers (2021-03-01T16:54:50Z)
Investigating African-American Vernacular English in Transformer-Based Text Generation [55.53547556060537]
Social media has encouraged the written use of African American Vernacular English (AAVE) We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs. We find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both.
arXiv Detail & Related papers (2020-10-06T06:27:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.