ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories
- URL: http://arxiv.org/abs/2505.14917v1
- Date: Tue, 20 May 2025 21:12:30 GMT
- Title: ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories
- Authors: Zhiwei Liu, Paul Thompson, Jiaqi Rong, Sophia Ananiadou,
- Abstract summary: Large language models (LLMs) can cause harm, e.g., through automatic generation of misinformation, including conspiracy theories.<n>LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone.
- Score: 23.28977129097429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at https://github.com/lzw108/ConspEmoLLM.
Related papers
- Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning [12.06050648342985]
Hallucination poses a major obstacle in the reasoning capabilities of large language models.<n>We introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like "Who is Undercover?"<n>MUG reframes MAD as a process of detecting "undercover" agents (those suffering from hallucinations) by employing multimodal counterfactual tests.
arXiv Detail & Related papers (2025-11-14T11:27:55Z) - Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models [6.909716378472136]
We investigate whether Large Language Models (LLMs) exhibit conspiratorial tendencies, whether they display sociodemographic biases in this domain, and how easily they can be conditioned into adopting conspiratorial perspectives.
arXiv Detail & Related papers (2025-11-05T18:28:28Z) - ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety [87.90209836101353]
ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits.<n>We develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts.<n>We evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs.
arXiv Detail & Related papers (2025-08-28T06:39:25Z) - TRAPDOC: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents [4.753535328327316]
Over-reliance on large language models (LLMs) is emerging as a significant social issue.<n>We propose a method injecting imperceptible phantom tokens into documents, which causes LLMs to generate outputs that appear plausible to users but are in fact incorrect.<n>Based on this technique, we introduce TRAPDOC, a framework designed to deceive over-reliant LLM users.
arXiv Detail & Related papers (2025-05-30T07:16:53Z) - Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors [65.27124213266491]
We propose textbfContrastive textbfParaphrase textbfAttack (CoPA), a training-free method that effectively deceives text detectors.<n>CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by large language models.<n>Our theoretical analysis suggests the superiority of the proposed attack.
arXiv Detail & Related papers (2025-05-21T10:08:39Z) - Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features [5.774786149181392]
alicious users can exploit large language models (LLMs) to produce paraphrased versions of proprietary code that closely resemble the original.<n>We develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code.<n> LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively.
arXiv Detail & Related papers (2025-02-25T00:58:06Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - Phantom: General Trigger Attacks on Retrieval Augmented Language Generation [30.63258739968483]
Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs)
We propose new attack vectors that allow an adversary to inject a single malicious document into a RAG system's knowledge base, and mount a backdoor poisoning attack.
We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama, and show that they transfer to GPT-3.5 Turbo and GPT-4.
arXiv Detail & Related papers (2024-05-30T21:19:24Z) - ConspEmoLLM: Conspiracy Theory Detection Using an Emotion-Based Large Language Model [24.650234558124442]
We propose ConspEmoLLM, the first open-source LLM that integrates affective information and is able to perform diverse tasks relating to conspiracy theories.
ConspEmoLLM is fine-tuned based on an emotion-oriented LLM using our novel ConDID dataset.
arXiv Detail & Related papers (2024-03-11T14:35:45Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs'
Generation [60.77990074569754]
We present a computation-efficient framework that steers a frozen Pre-Trained Language Model towards more commonsensical generation.
Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score.
We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head.
arXiv Detail & Related papers (2023-10-25T23:32:12Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Detecting Language Model Attacks with Perplexity [0.0]
A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses.
A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.
arXiv Detail & Related papers (2023-08-27T15:20:06Z) - RADAR: Robust AI-Text Detection via Adversarial Learning [69.5883095262619]
RADAR is based on adversarial training of a paraphraser and a detector.
The paraphraser's goal is to generate realistic content to evade AI-text detection.
RADAR uses the feedback from the detector to update the paraphraser, and vice versa.
arXiv Detail & Related papers (2023-07-07T21:13:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.