Related papers: On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

URL: http://arxiv.org/abs/2412.10535v1
Date: Fri, 13 Dec 2024 20:04:25 GMT
Title: On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models
Authors: April Yang, Jordan Tab, Parth Shah, Paul Kotchavong,
Abstract summary: We investigate the correlation between adversarial robustness and OOD robustness in large language models (LLMs)<n>Our findings highlight nuanced interactions between adversarial robustness and OOD robustness, with results indicating limited transferability.<n>Further research is needed to evaluate these interactions across larger models and varied architectures.
Score: 0.16874375111244325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing reliance on large language models (LLMs) for diverse applications necessitates a thorough understanding of their robustness to adversarial perturbations and out-of-distribution (OOD) inputs. In this study, we investigate the correlation between adversarial robustness and OOD robustness in LLMs, addressing a critical gap in robustness evaluation. By applying methods originally designed to improve one robustness type across both contexts, we analyze their performance on adversarial and out-of-distribution benchmark datasets. The input of the model consists of text samples, with the output prediction evaluated in terms of accuracy, precision, recall, and F1 scores in various natural language inference tasks. Our findings highlight nuanced interactions between adversarial robustness and OOD robustness, with results indicating limited transferability between the two robustness types. Through targeted ablations, we evaluate how these correlations evolve with different model sizes and architectures, uncovering model-specific trends: smaller models like LLaMA2-7b exhibit neutral correlations, larger models like LLaMA2-13b show negative correlations, and Mixtral demonstrates positive correlations, potentially due to domain-specific alignment. These results underscore the importance of hybrid robustness frameworks that integrate adversarial and OOD strategies tailored to specific models and domains. Further research is needed to evaluate these interactions across larger models and varied architectures, offering a pathway to more reliable and generalizable LLMs.

Related papers

Harnessing Consistency for Robust Test-Time LLM Ensemble [88.55393815158608]
CoRE is a plug-and-play technique that harnesses model consistency for robust LLM ensemble.<n> Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens.<n>Model-level consistency models global agreement by promoting model outputs with high self-confidence.
arXiv Detail & Related papers (2025-10-12T04:18:45Z)
Towards Robust LLMs: an Adversarial Robustness Measurement Framework [0.0]
Large Language Models (LLMs) remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications. We adapt the Robustness Measurement and Assessment framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters. Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.
arXiv Detail & Related papers (2025-04-24T16:36:19Z)
SALAD: Improving Robustness and Generalization through Contrastive Learning with Structure-Aware and LLM-Driven Augmented Data [15.366930934639838]
We propose SALAD, a novel approach to enhance model robustness and generalization. Our method generates structure-aware and counterfactually augmented data for contrastive learning. We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference.
arXiv Detail & Related papers (2025-04-16T15:40:10Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents.<n>Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses.<n>This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions [49.546479320670464]
This paper introduces specialized metrics for benchmarking the spatial robustness of segmentation models. We propose region-aware multi-attack adversarial analysis, a method that enables a deeper understanding of model robustness. The results reveal that models respond to these two types of threats differently.
arXiv Detail & Related papers (2025-04-02T11:37:39Z)
Alignment and Adversarial Robustness: Are More Human-Like Models More Secure? [2.5228303963685366]
We conduct a large-scale empirical analysis to investigate the relationship between representational alignment and adversarial robustness. Our findings reveal that while average alignment and robustness exhibit a weak overall correlation, specific alignment benchmarks serve as strong predictors of adversarial robustness.
arXiv Detail & Related papers (2025-02-17T23:30:50Z)
On Adversarial Robustness of Language Models in Transfer Learning [13.363850350446869]
We show that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods.
arXiv Detail & Related papers (2024-12-29T15:55:35Z)
Bridging Interpretability and Robustness Using LIME-Guided Model Refinement [0.0]
Local Interpretable Model-Agnostic Explanations (LIME) systematically enhance model robustness.<n> Empirical evaluations on multiple benchmark datasets demonstrate that LIME-guided refinement not only improves interpretability but also significantly enhances resistance to adversarial perturbations and generalization to out-of-distribution data.
arXiv Detail & Related papers (2024-12-25T17:32:45Z)
Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models [1.6874375111244329]
We explore the collaborative dynamics of an innovative language model interaction system involving advanced models. These models generate and answer complex, PhD-level statistical questions without exact ground-truth answers. Our study investigates how inter-model consensus enhances the reliability and precision of responses.
arXiv Detail & Related papers (2024-11-25T10:18:17Z)
The BRAVO Semantic Segmentation Challenge Results in UNCV2024 [68.20197719071436]
We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.
arXiv Detail & Related papers (2024-09-23T15:17:30Z)
JAB: Joint Adversarial Prompting and Belief Augmentation [81.39548637776365]
We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
arXiv Detail & Related papers (2023-11-16T00:35:54Z)
On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model, Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews. We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z)
Fairness Increases Adversarial Vulnerability [50.90773979394264]
This paper shows the existence of a dichotomy between fairness and robustness, and analyzes when achieving fairness decreases the model robustness to adversarial samples. Experiments on non-linear models and different architectures validate the theoretical findings in multiple vision domains. The paper proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.
arXiv Detail & Related papers (2022-11-21T19:55:35Z)
Improving Adversarial Robustness via Mutual Information Estimation [144.33170440878519]
Deep neural networks (DNNs) are found to be vulnerable to adversarial noise. In this paper, we investigate the dependence between outputs of the target model and input adversarial samples from the perspective of information theory. We propose to enhance the adversarial robustness by maximizing the natural MI and minimizing the adversarial MI during the training process.
arXiv Detail & Related papers (2022-07-25T13:45:11Z)
Models Out of Line: A Fourier Lens on Distribution Shift Robustness [29.12208822285158]
Improving accuracy of deep neural networks (DNNs) on out-of-distribution (OOD) data is critical to an acceptance of deep learning (DL) in real world applications. Recently, some promising approaches have been developed to improve OOD robustness. There still is no clear understanding of the conditions on OOD data and model properties that are required to observe effective robustness.
arXiv Detail & Related papers (2022-07-08T18:05:58Z)
Adversarially Robust Estimate and Risk Analysis in Linear Regression [17.931533943788335]
Adversarially robust learning aims to design algorithms that are robust to small adversarial perturbations on input variables. By discovering the statistical minimax rate of convergence of adversarially robust estimators, we emphasize the importance of incorporating model information. We propose a straightforward two-stage adversarial learning framework, which facilitates to utilize model structure information to improve adversarial robustness.
arXiv Detail & Related papers (2020-12-18T14:55:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.