Related papers: Benchmarking Gaslighting Attacks Against Speech Large Language Models

Benchmarking Gaslighting Attacks Against Speech Large Language Models

URL: http://arxiv.org/abs/2509.19858v1
Date: Wed, 24 Sep 2025 07:57:10 GMT
Title: Benchmarking Gaslighting Attacks Against Speech Large Language Models
Authors: Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou,
Abstract summary: We introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning.<n>Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation.<n>Our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals.
Score: 31.842578503471586
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

Related papers

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations [94.62792643569567]
This work systematically investigates the role of speaker emotion.<n>We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs.<n>Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk.
arXiv Detail & Related papers (2025-10-19T15:41:25Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Backdoor Attacks Against Speech Language Models [63.07317091368079]
We present the first systematic study of audio backdoor attacks against speech language models.<n>We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks.<n>We propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
arXiv Detail & Related papers (2025-10-01T17:45:04Z)
Adversarial Attack Classification and Robustness Testing for Large Language Models for Code [19.47426054151291]
This study investigates how adversarial perturbations in natural language inputs affect Large Language Models for Code (LLM4Code)<n>It examines the effects of perturbations at the character, word, and sentence levels to identify the most impactful vulnerabilities.
arXiv Detail & Related papers (2025-06-09T17:02:29Z)
Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study [5.843063647136238]
This study systematically evaluates five leading LALMs across four attack scenarios.<n>No single model consistently outperforms others across all attack types.<n>A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible.
arXiv Detail & Related papers (2025-05-26T07:08:38Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models [60.72029578488467]
Adrial audio attacks pose a significant threat to the growing use of large audio-language models (LALMs) in human-machine interactions.<n>We introduce the Chat-Audio Attacks benchmark including four distinct types of audio attacks.<n>We evaluate six state-of-the-art LALMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others.
arXiv Detail & Related papers (2024-11-22T10:30:48Z)
Assessing Adversarial Robustness of Large Language Models: An Empirical Study [24.271839264950387]
Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We present a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5.
arXiv Detail & Related papers (2024-05-04T22:00:28Z)
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning [63.481446315733145]
Cross-lingual backdoor attacks against multilingual large language models (LLMs) are under-explored.<n>Our research focuses on how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data were not poisoned.<n>Our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages.
arXiv Detail & Related papers (2024-04-30T14:43:57Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models [0.47334880432883714]
We present an analysis of adversarial robustness exhibited by various hate-speech detection models. We devise and execute targeted attacks on the text by leveraging the TextAttack tool. This work paves the way for creating more robust and reliable hate-speech detection systems.
arXiv Detail & Related papers (2023-05-29T19:59:40Z)
Characterizing the adversarial vulnerability of speech self-supervised learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.