Related papers: Large-Scale Text Analysis Using Generative Language Models: A Case Study in Discovering Public Value Expressions in AI Patents

Large-Scale Text Analysis Using Generative Language Models: A Case Study in Discovering Public Value Expressions in AI Patents

URL: http://arxiv.org/abs/2305.10383v2
Date: Thu, 18 May 2023 12:34:47 GMT
Title: Large-Scale Text Analysis Using Generative Language Models: A Case Study in Discovering Public Value Expressions in AI Patents
Authors: Sergio Pelaez, Gaurav Verma, Barbara Ribeiro, Philip Shapira
Abstract summary: This paper employs a novel approach using a generative language model (GPT-4) to produce labels and rationales for large-scale text analysis. We collect a database comprising 154,934 patent documents using an advanced Boolean query submitted to InnovationQ+. We design a framework for identifying and labeling public value expressions in these AI patent sentences.
Score: 2.246222223318928
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Labeling data is essential for training text classifiers but is often difficult to accomplish accurately, especially for complex and abstract concepts. Seeking an improved method, this paper employs a novel approach using a generative language model (GPT-4) to produce labels and rationales for large-scale text analysis. We apply this approach to the task of discovering public value expressions in US AI patents. We collect a database comprising 154,934 patent documents using an advanced Boolean query submitted to InnovationQ+. The results are merged with full patent text from the USPTO, resulting in 5.4 million sentences. We design a framework for identifying and labeling public value expressions in these AI patent sentences. A prompt for GPT-4 is developed which includes definitions, guidelines, examples, and rationales for text classification. We evaluate the quality of the labels and rationales produced by GPT-4 using BLEU scores and topic modeling and find that they are accurate, diverse, and faithful. These rationales also serve as a chain-of-thought for the model, a transparent mechanism for human verification, and support for human annotators to overcome cognitive limitations. We conclude that GPT-4 achieved a high-level of recognition of public value theory from our framework, which it also uses to discover unseen public value expressions. We use the labels produced by GPT-4 to train BERT-based classifiers and predict sentences on the entire database, achieving high F1 scores for the 3-class (0.85) and 2-class classification (0.91) tasks. We discuss the implications of our approach for conducting large-scale text analyses with complex and abstract concepts and suggest that, with careful framework design and interactive human oversight, generative language models can offer significant advantages in quality and in reduced time and costs for producing labels and rationales.

Related papers

Text Chunking for Document Classification for Urban System Management using Large Language Models [0.0]
Urban systems are managed using complex textual documentation to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements.
arXiv Detail & Related papers (2025-03-31T22:48:30Z)
A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization [0.0]
This study proposes a system for efficiently creating abstractive summaries of patent records. The procedure involves leveraging the LexRank graph-based algorithm to retrieve the important sentences from input parent texts.
arXiv Detail & Related papers (2025-03-13T13:30:54Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
De-jargonizing Science for Journalists with GPT-4: A Pilot Study [3.730699089967391]
The system achieves fairly high recall in identifying jargon and preserves relative differences in readers' jargon identification. The findings highlight the potential of generative AI for assisting science reporters, and can inform future work on developing tools to simplify dense documents.
arXiv Detail & Related papers (2024-10-15T21:10:01Z)
GPT Assisted Annotation of Rhetorical and Linguistic Features for Interpretable Propaganda Technique Detection in News Text [1.2699007098398802]
This study codifies 22 rhetorical and linguistic features identified in literature related to the language of persuasion. RhetAnn, a web application, was specifically designed to minimize an otherwise considerable mental effort. A small set of annotated data was used to fine-tune GPT-3.5, a generative large language model (LLM), to annotate the remaining data.
arXiv Detail & Related papers (2024-07-16T15:15:39Z)
ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs. BERT-based extraction methods require large amounts of task-specific training data. This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z)
Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent Classification [26.85734804493925]
We propose an integrated framework that comprehensively considers the information on patents for patent classification. We first present an IPC codes correlations learning module to derive their semantic representations. Finally, we combine the contextual information of patent texts that contains the semantics of IPC codes, and assignees' sequential preferences to make predictions.
arXiv Detail & Related papers (2023-08-10T07:02:24Z)
Mao-Zedong At SemEval-2023 Task 4: Label Represention Multi-Head Attention Model With Contrastive Learning-Enhanced Nearest Neighbor Mechanism For Multi-Label Text Classification [0.0]
SemEval 2023 Task 4citekiesel:2023 provides a set of arguments and 20 types of human values implicitly expressed in each argument. We propose a multi-head attention mechanism to establish connections between specific labels and semantic components. Our approach achieved an F1 score of 0.533 on the test set and ranked fourth on the leaderboard.
arXiv Detail & Related papers (2023-07-11T11:12:06Z)
Description-Enhanced Label Embedding Contrastive Learning for Text Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task. Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets. external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)
On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z)
Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
Introduction of a novel word embedding approach based on technology labels extracted from patent data [0.0]
This paper introduces a word embedding approach using statistical analysis of human labeled data to produce accurate and language independent word vectors for technical terms. The resulting algorithm is a development of the former EQMania UG (eqmania.com) and can be tested under eqalice.com until April 2021.
arXiv Detail & Related papers (2021-01-31T10:37:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.