Related papers: An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs

An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs

URL: http://arxiv.org/abs/2508.10010v1
Date: Wed, 06 Aug 2025 02:14:28 GMT
Title: An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs
Authors: Ayana Hussain, Patrick Zhao, Nicholas Vincent,
Abstract summary: Large Language Models (LLMs) are capable of generating harmful misinformation -- inadvertently, or when prompted by "jailbreak" attacks that attempt to produce malicious outputs.<n>This paper investigates the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation.<n>We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media.
Score: 5.0015751459745825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation -- inadvertently, or when prompted by "jailbreak" attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.

Related papers

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z)
How does Misinformation Affect Large Language Model Behaviors and Preferences? [37.06385727015972]
Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks.<n>We present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation.<n> Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations.
arXiv Detail & Related papers (2025-05-27T17:57:44Z)
"I know myself better, but not really greatly": How Well Can LLMs Detect and Explain LLM-Generated Texts? [10.454446545249096]
This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an undecided'' class)<n>We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs)<n>Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.
arXiv Detail & Related papers (2025-02-18T11:00:28Z)
Can Editing LLMs Inject Harm? [122.83469484328465]
We propose to reformulate knowledge editing as a new type of safety threat for Large Language Models. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase.
arXiv Detail & Related papers (2024-07-29T17:58:06Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks. Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query. We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Combating Misinformation in the Age of LLMs: Opportunities and Challenges [21.712051537924136]
The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowledge and strong reasoning abilities. On the other hand, the critical challenge is that LLMs can be easily leveraged to generate deceptive misinformation at scale.
arXiv Detail & Related papers (2023-11-09T00:05:27Z)
Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong [35.64962031447787]
Large Language Models (LLMs) are increasingly used for accessing information on the web. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy.
arXiv Detail & Related papers (2023-10-19T08:09:58Z)
Can LLM-Generated Misinformation Be Detected? [18.378744138365537]
Large Language Models (LLMs) can be exploited to generate misinformation. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation?
arXiv Detail & Related papers (2023-09-25T00:45:07Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.