Certified Robustness for Large Language Models with Self-Denoising
- URL: http://arxiv.org/abs/2307.07171v1
- Date: Fri, 14 Jul 2023 05:40:24 GMT
- Title: Certified Robustness for Large Language Models with Self-Denoising
- Authors: Zhen Zhang, Guanhua Zhang, Bairu Hou, Wenqi Fan, Qing Li, Sijia Liu,
Yang Zhang, Shiyu Chang
- Abstract summary: We propose to denoise the corrupted inputs with large language models (LLMs) in a self-denoising manner.
Our method outperforms the existing certification methods under both certified robustness and empirical robustness.
- Score: 42.916661225753145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although large language models (LLMs) have achieved great success in vast
real-world applications, their vulnerabilities towards noisy inputs have
significantly limited their uses, especially in high-stake environments. In
these contexts, it is crucial to ensure that every prediction made by large
language models is stable, i.e., LLM predictions should be consistent given
minor differences in the input. This largely falls into the study of certified
robust LLMs, i.e., all predictions of LLM are certified to be correct in a
local region around the input. Randomized smoothing has demonstrated great
potential in certifying the robustness and prediction stability of LLMs.
However, randomized smoothing requires adding noise to the input before model
prediction, and its certification performance depends largely on the model's
performance on corrupted data. As a result, its direct application to LLMs
remains challenging and often results in a small certification radius. To
address this issue, we take advantage of the multitasking nature of LLMs and
propose to denoise the corrupted inputs with LLMs in a self-denoising manner.
Different from previous works like denoised smoothing, which requires training
a separate model to robustify LLM, our method enjoys far better efficiency and
flexibility. Our experiment results show that our method outperforms the
existing certification methods under both certified robustness and empirical
robustness. The codes are available at
https://github.com/UCSB-NLP-Chang/SelfDenoise.
Related papers
- Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs.
This raises critical concerns about the robustness and reliability of Tabular LLMs.
This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Evaluation and Improvement of Fault Detection for Large Language Models [30.760472387136954]
This paper investigates the effectiveness of existing fault detection methods for large language models (LLMs)
We propose textbfMuCS, a prompt textbfMutation-based prediction textbfConfidence textbfSmoothing framework to boost the fault detection capability of existing methods.
arXiv Detail & Related papers (2024-04-14T07:06:12Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - Purifying Large Language Models by Ensembling a Small Language Model [39.57304668057076]
We propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data.
We empirically confirm the efficacy of ensembling LLMs with benign and small language models (SLMs)
arXiv Detail & Related papers (2024-02-19T14:00:39Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities.
We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment.
Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.