Related papers: Unveiling Safety Vulnerabilities of Large Language Models

Unveiling Safety Vulnerabilities of Large Language Models

URL: http://arxiv.org/abs/2311.04124v1
Date: Tue, 7 Nov 2023 16:50:33 GMT
Title: Unveiling Safety Vulnerabilities of Large Language Models
Authors: George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz and Eitan Farchi
Abstract summary: This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. We introduce a novel automatic approach for identifying and naming vulnerable semantic regions.
Score: 4.562678399685183
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

Related papers

Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning [54.26807397329468]
This work explores a previously overlooked vulnerability in distributed deep learning systems.<n>An adversary who intercepts the intermediate features transmitted between them can still pose a serious threat.<n>We propose an exploitation strategy specifically designed for distributed settings.
arXiv Detail & Related papers (2025-07-09T20:09:00Z)
Preference Learning for AI Alignment: a Causal Perspective [55.2480439325792]
We frame this problem in a causal paradigm, providing the rich toolbox of causality to identify persistent challenges.<n>Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation.<n>We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness.
arXiv Detail & Related papers (2025-06-06T10:45:42Z)
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards [13.197807179926428]
Large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.<n>In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data.
arXiv Detail & Related papers (2025-05-22T15:30:00Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Towards the generation of hierarchical attack models from cybersecurity vulnerabilities using language models [3.7548609506798494]
This paper investigates the use of a pre-trained language model and siamese network to discern sibling relationships between text-based cybersecurity vulnerability data.
arXiv Detail & Related papers (2024-10-07T13:05:33Z)
A Survey and Evaluation of Adversarial Attacks for Object Detection [11.48212060875543]
Deep learning models are vulnerable to adversarial examples that can deceive them into making confident but incorrect predictions. This vulnerability pose significant risks in high-stakes applications such as autonomous vehicles, security surveillance, and safety-critical inspection systems. This paper presents a novel taxonomic framework for categorizing adversarial attacks specific to object detection architectures.
arXiv Detail & Related papers (2024-08-04T05:22:08Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks. We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts. We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models [0.47334880432883714]
We present an analysis of adversarial robustness exhibited by various hate-speech detection models. We devise and execute targeted attacks on the text by leveraging the TextAttack tool. This work paves the way for creating more robust and reliable hate-speech detection systems.
arXiv Detail & Related papers (2023-05-29T19:59:40Z)
It Is All About Data: A Survey on the Effects of Data on Adversarial Robustness [4.1310970179750015]
Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to confuse the model into making a mistake. To address this problem, the area of adversarial robustness investigates mechanisms behind adversarial attacks and defenses against these attacks.
arXiv Detail & Related papers (2023-03-17T04:18:03Z)
Poisoning Attacks and Defenses on Artificial Intelligence: A Survey [3.706481388415728]
Data poisoning attacks represent a type of attack that consists of tampering the data samples fed to the model during the training phase, leading to a degradation in the models accuracy during the inference phase. This work compiles the most relevant insights and findings found in the latest existing literatures addressing this type of attacks. A thorough assessment is performed on the reviewed works, comparing the effects of data poisoning on a wide range of ML models in real-world conditions.
arXiv Detail & Related papers (2022-02-21T14:43:38Z)
CC-Cert: A Probabilistic Approach to Certify General Robustness of Neural Networks [58.29502185344086]
In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks. It is important to provide provable guarantees for deep learning models against semantically meaningful input transformations. We propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds.
arXiv Detail & Related papers (2021-09-22T12:46:04Z)
Explainable Adversarial Attacks in Deep Neural Networks Using Activation Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples. We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z)
On the Transferability of Adversarial Attacksagainst Neural Text Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models. We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models. We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.