Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data
- URL: http://arxiv.org/abs/2409.09741v1
- Date: Sun, 15 Sep 2024 14:11:24 GMT
- Title: Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data
- Authors: Bastián González-Bustamante,
- Abstract summary: This article benchmarked the ability of OpenAI's GPTs and a number of open-source LLMs to perform annotation tasks on political content.
We used a novel protest event dataset comprising more than three million digital interactions.
We created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article benchmarked the ability of OpenAI's GPTs and a number of open-source LLMs to perform annotation tasks on political content. We used a novel protest event dataset comprising more than three million digital interactions and created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media. We included in our benchmark Google's Perspective algorithm, which, along with GPTs, was employed throughout their respective APIs while the open-source LLMs were deployed locally. The findings show that Perspective API using a laxer threshold, GPT-4o, and Nous Hermes 2 Mixtral outperform other LLM's zero-shot classification annotations. In addition, Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with high performance, being attractive options that could offer good trade-offs between performance, implementing costs and computing time. Ancillary findings using experiments setting different temperature levels show that although GPTs tend to show not only excellent computing time but also overall good levels of reliability, only open-source LLMs ensure full reproducibility in the annotation.
Related papers
- ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction [33.03433653251314]
We propose ELF-Gym, a framework for evaluating Large Language Models (LLMs)
We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams.
We empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%.
arXiv Detail & Related papers (2024-10-13T13:59:33Z) - LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs)
It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation [0.0]
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation.
We find that open LLMs can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd.
arXiv Detail & Related papers (2024-01-18T18:15:46Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning [5.822010906632045]
This paper studies the performance of open-source Large Language Models (LLMs) in text classification tasks typical for political science research.
By examining tasks like stance, topic, and relevance classification, we aim to guide scholars in making informed decisions about their use of LLMs for text analysis.
arXiv Detail & Related papers (2023-07-05T10:15:07Z) - Element-aware Summarization with Large Language Models: Expert-aligned
Evaluation and Chain-of-Thought Method [35.181659789684545]
Automatic summarization generates concise summaries that contain key ideas of source documents.
References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy.
We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step.
Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
arXiv Detail & Related papers (2023-05-22T18:54:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.