Related papers: Multiclass Classification of Policy Documents with Large Language Models

Multiclass Classification of Policy Documents with Large Language Models

URL: http://arxiv.org/abs/2310.08167v1
Date: Thu, 12 Oct 2023 09:41:22 GMT
Title: Multiclass Classification of Policy Documents with Large Language Models
Authors: Erkan Gunes, Christoffer Koch Florczak
Abstract summary: We use the GPT 3.5 and GPT 4 models of the OpenAI to classify congressional bills and congressional hearings into Comparative Agendas Project's 21 major policy issue topics. We propose three use-case scenarios and estimate overall accuracies ranging from %58-83 depending on scenario and GPT model employed. Overall, our results point towards the insufficiency of complete reliance on GPT with minimal human intervention, an increasing accuracy along with the human effort exerted, and a surprisingly high accuracy achieved in the most humanly demanding use-case.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Classifying policy documents into policy issue topics has been a long-time effort in political science and communication disciplines. Efforts to automate text classification processes for social science research purposes have so far achieved remarkable results, but there is still a large room for progress. In this work, we test the prediction performance of an alternative strategy, which requires human involvement much less than full manual coding. We use the GPT 3.5 and GPT 4 models of the OpenAI, which are pre-trained instruction-tuned Large Language Models (LLM), to classify congressional bills and congressional hearings into Comparative Agendas Project's 21 major policy issue topics. We propose three use-case scenarios and estimate overall accuracies ranging from %58-83 depending on scenario and GPT model employed. The three scenarios aims at minimal, moderate, and major human interference, respectively. Overall, our results point towards the insufficiency of complete reliance on GPT with minimal human intervention, an increasing accuracy along with the human effort exerted, and a surprisingly high accuracy achieved in the most humanly demanding use-case. However, the superior use-case achieved the %83 accuracy on the %65 of the data in which the two models agreed, suggesting that a similar approach to ours can be relatively easily implemented and allow for mostly automated coding of a majority of a given dataset. This could free up resources allowing manual human coding of the remaining %35 of the data to achieve an overall higher level of accuracy while reducing costs significantly.

Related papers

GPT-5 vs Other LLMs in Long Short-Context Performance [2.640490999540592]
This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks.<n>As the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly.<n>In the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%.
arXiv Detail & Related papers (2026-02-15T15:26:25Z)
Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering [0.0]
We study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task.<n>We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies.<n>Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction.
arXiv Detail & Related papers (2026-01-13T03:10:58Z)
Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning [33.27410930782468]
We introduce a new research direction guided by brain-science findings that human reasoning remains reliable under imperfect inputs.<n>We fine-tune three large language models (LLMs) and a state-of-the-art large vision-language model (LVLM) usingReinforcement learning (RL) with a representative policy-gradient algorithm.<n>Our results reveal that while RL fine-tuning improves baseline reasoning under idealized settings, performance declines significantly across all three non-ideal scenarios.
arXiv Detail & Related papers (2025-08-06T19:51:29Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Our approach only leverages a small number of samples to search for the desired pruning policy. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [51.41246396610475]
This paper aims to predict performance in closed-book question answering (QA) without the help of external tools.<n>We conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models.<n>Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
LLM4DS: Evaluating Large Language Models for Data Science Code Generation [0.0]
This paper empirically assesses the performance of four leading AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet) and Perplexity Labs (Llama-3.1-70b-instruct) All models exceeded a 50% success rate, confirming their capability beyond random chance. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity.
arXiv Detail & Related papers (2024-11-16T18:43:26Z)
Efficacy of Large Language Models in Systematic Reviews [0.0]
This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations.
arXiv Detail & Related papers (2024-08-03T00:01:13Z)
Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses. In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases. Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks. All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z)
MetaKP: On-Demand Keyphrase Generation [52.48698290354449]
We introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. We present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. We demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
arXiv Detail & Related papers (2024-06-28T19:02:59Z)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO. Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z)
Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance [4.305568120980929]
In-context learning with GPT-3.5 and GPT-4 minimizes the technical expertise required and eliminates the need for expensive GPU computing. We fine-tune other pre-trained, masked language models with SetFit to achieve state-of-the-art results both in full-data and few-shot settings. Our findings show that querying GPT-3.5 and GPT-4 can outperform fine-tuned, non-generative models even with fewer examples.
arXiv Detail & Related papers (2023-08-28T15:04:16Z)
Assessing the Effectiveness of GPT-3 in Detecting False Political Statements: A Case Study on the LIAR Dataset [0.0]
The detection of political fake statements is crucial for maintaining information integrity and preventing the spread of misinformation in society. Historically, state-of-the-art machine learning models employed various methods for detecting deceptive statements. Recent advancements in large language models, such as GPT-3, have achieved state-of-the-art performance on a wide range of tasks.
arXiv Detail & Related papers (2023-06-14T01:16:49Z)
LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z)
Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting. The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z)
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z)
Fast Uncertainty Quantification for Deep Object Pose Estimation [91.09217713805337]
Deep learning-based object pose estimators are often unreliable and overconfident. In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose estimation.
arXiv Detail & Related papers (2020-11-16T06:51:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.