Related papers: Fooling LLM graders into giving better grades through neural activity guided adversarial prompting

Fooling LLM graders into giving better grades through neural activity guided adversarial prompting

URL: http://arxiv.org/abs/2412.15275v1
Date: Tue, 17 Dec 2024 19:08:22 GMT
Title: Fooling LLM graders into giving better grades through neural activity guided adversarial prompting
Authors: Atsushi Yamamura, Surya Ganguli,
Abstract summary: We propose a systematic method to reveal such biases in AI evaluation systems.<n>Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes.<n>We demonstrate that this combination can effectively fool large language model graders into assigning much higher grades than humans would.
Score: 26.164839501935973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The deployment of artificial intelligence (AI) in critical decision-making and evaluation processes raises concerns about inherent biases that malicious actors could exploit to distort decision outcomes. We propose a systematic method to reveal such biases in AI evaluation systems and apply it to automated essay grading as an example. Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes and then optimizes an adversarial input suffix to amplify such patterns. We demonstrate that this combination can effectively fool large language model (LLM) graders into assigning much higher grades than humans would. We further show that this white-box attack transfers to black-box attacks on other models, including commercial closed-source models like Gemini. They further reveal the existence of a "magic word" that plays a pivotal role in the efficacy of the attack. We trace the origin of this magic word bias to the structure of commonly-used chat templates for supervised fine-tuning of LLMs and show that a minor change in the template can drastically reduce the bias. This work not only uncovers vulnerabilities in current LLMs but also proposes a systematic method to identify and remove hidden biases, contributing to the goal of ensuring AI safety and security.

Related papers

Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models [15.134149399922192]
We propose a novel jailbreaking method inspired by the Chain-of-Thought mechanism.<n>The attacker model uses mission transfer to conceal harmful user intent in dialogue and generates chained narrative lures to stimulate the reasoning capabilities of victim models.<n>Our experiments reveal that models with weaker safety mechanisms exhibit stronger attack capabilities, demonstrating that models can not only be exploited, but also help harm others.
arXiv Detail & Related papers (2025-05-23T06:19:05Z)
Feature-Aware Malicious Output Detection and Mitigation [8.378272216429954]
We propose a feature-aware method for harmful response rejection (FMM) FMM detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques.
arXiv Detail & Related papers (2025-04-12T12:12:51Z)
Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks. We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction. We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z)
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats. This paper presents an innovative defensive strategy, given white box access to an LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Defense Against Model Extraction Attacks on Recommender Systems [53.127820987326295]
We introduce Gradient-based Ranking Optimization (GRO) to defend against model extraction attacks on recommender systems. GRO aims to minimize the loss of the protected target model while maximizing the loss of the attacker's surrogate model. Results show GRO's superior effectiveness in defending against model extraction attacks.
arXiv Detail & Related papers (2023-10-25T03:30:42Z)
Open Sesame! Universal Black Box Jailbreaking of Large Language Models [0.0]
Large language models (LLMs) are designed to provide helpful and safe responses. LLMs often rely on alignment techniques to align with user intent and social guidelines. We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible.
arXiv Detail & Related papers (2023-09-04T08:54:20Z)
Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models [48.93128542994217]
We propose an imitation adversarial attack on black-box neural passage ranking models. We show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates. We also propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers.
arXiv Detail & Related papers (2022-09-14T09:10:07Z)
RamBoAttack: A Robust Query Efficient Deep Neural Network Decision Exploit [9.93052896330371]
We develop a robust query efficient attack capable of avoiding entrapment in a local minimum and misdirection from noisy gradients. The RamBoAttack is more robust to the different sample inputs available to an adversary and the targeted class.
arXiv Detail & Related papers (2021-12-10T01:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.