Large language models can consistently generate high-quality content for election disinformation operations
- URL: http://arxiv.org/abs/2408.06731v1
- Date: Tue, 13 Aug 2024 08:45:34 GMT
- Title: Large language models can consistently generate high-quality content for election disinformation operations
- Authors: Angus R. Williams, Liam Burke-Moore, Ryan Sze-Yin Chan, Florence E. Enock, Federico Nanni, Tvesha Sippy, Yi-Ling Chung, Evelina Gabasova, Kobi Hackenburg, Jonathan Bright,
- Abstract summary: Large language models have raised concerns about their potential use in generating compelling election disinformation at scale.
This study presents a two-part investigation into the capabilities of LLMs to automate stages of an election disinformation operation.
- Score: 2.98293101034582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in large language models have raised concerns about their potential use in generating compelling election disinformation at scale. This study presents a two-part investigation into the capabilities of LLMs to automate stages of an election disinformation operation. First, we introduce DisElect, a novel evaluation dataset designed to measure LLM compliance with instructions to generate content for an election disinformation operation in localised UK context, containing 2,200 malicious prompts and 50 benign prompts. Using DisElect, we test 13 LLMs and find that most models broadly comply with these requests; we also find that the few models which refuse malicious prompts also refuse benign election-related prompts, and are more likely to refuse to generate content from a right-wing perspective. Secondly, we conduct a series of experiments (N=2,340) to assess the "humanness" of LLMs: the extent to which disinformation operation content generated by an LLM is able to pass as human-written. Our experiments suggest that almost all LLMs tested released since 2022 produce election disinformation operation content indiscernible by human evaluators over 50% of the time. Notably, we observe that multiple models achieve above-human levels of humanness. Taken together, these findings suggest that current LLMs can be used to generate high-quality content for election disinformation operations, even in hyperlocalised scenarios, at far lower costs than traditional methods, and offer researchers and policymakers an empirical benchmark for the measurement and evaluation of these capabilities in current and future models.
Related papers
- United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections [45.84205238554709]
Large language models (LLMs) are perceived by some as having the potential to revolutionize social science research.
In this study, we examine to what extent LLM-based predictions of public opinion exhibit context-dependent biases.
We predict voting behavior in the 2024 European Parliament elections using a state-of-the-art LLM.
arXiv Detail & Related papers (2024-08-29T16:01:06Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Large Language Models (LLMs) as Agents for Augmented Democracy [6.491009626125319]
We explore an augmented democracy system built on off-the-shelf LLMs fine-tuned to augment data on citizen's preferences.
We use a train-test cross-validation setup to estimate the accuracy with which the LLMs predict both: a subject's individual political choices and the aggregate preferences of the full sample of participants.
arXiv Detail & Related papers (2024-05-06T13:23:57Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity
and Infant Care [14.326936563564171]
We present a benchmark, CARE-MI, for evaluating misinformation in large language models (LLMs)
Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models.
Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care.
arXiv Detail & Related papers (2023-07-04T03:34:19Z) - Large Language Models are Zero-Shot Rankers for Recommender Systems [76.02500186203929]
This work aims to investigate the capacity of large language models (LLMs) to act as the ranking model for recommender systems.
We show that LLMs have promising zero-shot ranking abilities but struggle to perceive the order of historical interactions.
We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies.
arXiv Detail & Related papers (2023-05-15T17:57:39Z) - PALR: Personalization Aware LLMs for Recommendation [7.407353565043918]
PALR aims to combine user history behaviors (such as clicks, purchases, ratings, etc.) with large language models (LLMs) to generate user preferred items.
Our solution outperforms state-of-the-art models on various sequential recommendation tasks.
arXiv Detail & Related papers (2023-05-12T17:21:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.