Large-Scale Text Analysis Using Generative Language Models: A Case Study
in Discovering Public Value Expressions in AI Patents
- URL: http://arxiv.org/abs/2305.10383v2
- Date: Thu, 18 May 2023 12:34:47 GMT
- Title: Large-Scale Text Analysis Using Generative Language Models: A Case Study
in Discovering Public Value Expressions in AI Patents
- Authors: Sergio Pelaez, Gaurav Verma, Barbara Ribeiro, Philip Shapira
- Abstract summary: This paper employs a novel approach using a generative language model (GPT-4) to produce labels and rationales for large-scale text analysis.
We collect a database comprising 154,934 patent documents using an advanced Boolean query submitted to InnovationQ+.
We design a framework for identifying and labeling public value expressions in these AI patent sentences.
- Score: 2.246222223318928
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Labeling data is essential for training text classifiers but is often
difficult to accomplish accurately, especially for complex and abstract
concepts. Seeking an improved method, this paper employs a novel approach using
a generative language model (GPT-4) to produce labels and rationales for
large-scale text analysis. We apply this approach to the task of discovering
public value expressions in US AI patents. We collect a database comprising
154,934 patent documents using an advanced Boolean query submitted to
InnovationQ+. The results are merged with full patent text from the USPTO,
resulting in 5.4 million sentences. We design a framework for identifying and
labeling public value expressions in these AI patent sentences. A prompt for
GPT-4 is developed which includes definitions, guidelines, examples, and
rationales for text classification. We evaluate the quality of the labels and
rationales produced by GPT-4 using BLEU scores and topic modeling and find that
they are accurate, diverse, and faithful. These rationales also serve as a
chain-of-thought for the model, a transparent mechanism for human verification,
and support for human annotators to overcome cognitive limitations. We conclude
that GPT-4 achieved a high-level of recognition of public value theory from our
framework, which it also uses to discover unseen public value expressions. We
use the labels produced by GPT-4 to train BERT-based classifiers and predict
sentences on the entire database, achieving high F1 scores for the 3-class
(0.85) and 2-class classification (0.91) tasks. We discuss the implications of
our approach for conducting large-scale text analyses with complex and abstract
concepts and suggest that, with careful framework design and interactive human
oversight, generative language models can offer significant advantages in
quality and in reduced time and costs for producing labels and rationales.
Related papers
- De-jargonizing Science for Journalists with GPT-4: A Pilot Study [3.730699089967391]
The system achieves fairly high recall in identifying jargon and preserves relative differences in readers' jargon identification.
The findings highlight the potential of generative AI for assisting science reporters, and can inform future work on developing tools to simplify dense documents.
arXiv Detail & Related papers (2024-10-15T21:10:01Z) - GPT Assisted Annotation of Rhetorical and Linguistic Features for Interpretable Propaganda Technique Detection in News Text [1.2699007098398802]
This study codifies 22 rhetorical and linguistic features identified in literature related to the language of persuasion.
RhetAnn, a web application, was specifically designed to minimize an otherwise considerable mental effort.
A small set of annotated data was used to fine-tune GPT-3.5, a generative large language model (LLM), to annotate the remaining data.
arXiv Detail & Related papers (2024-07-16T15:15:39Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent Classification [26.85734804493925]
We propose an integrated framework that comprehensively considers the information on patents for patent classification.
We first present an IPC codes correlations learning module to derive their semantic representations.
Finally, we combine the contextual information of patent texts that contains the semantics of IPC codes, and assignees' sequential preferences to make predictions.
arXiv Detail & Related papers (2023-08-10T07:02:24Z) - Mao-Zedong At SemEval-2023 Task 4: Label Represention Multi-Head
Attention Model With Contrastive Learning-Enhanced Nearest Neighbor Mechanism
For Multi-Label Text Classification [0.0]
SemEval 2023 Task 4citekiesel:2023 provides a set of arguments and 20 types of human values implicitly expressed in each argument.
We propose a multi-head attention mechanism to establish connections between specific labels and semantic components.
Our approach achieved an F1 score of 0.533 on the test set and ranked fourth on the leaderboard.
arXiv Detail & Related papers (2023-07-11T11:12:06Z) - Description-Enhanced Label Embedding Contrastive Learning for Text
Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task.
Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets.
external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z) - Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification.
Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Introduction of a novel word embedding approach based on technology
labels extracted from patent data [0.0]
This paper introduces a word embedding approach using statistical analysis of human labeled data to produce accurate and language independent word vectors for technical terms.
The resulting algorithm is a development of the former EQMania UG (eqmania.com) and can be tested under eqalice.com until April 2021.
arXiv Detail & Related papers (2021-01-31T10:37:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.