Surfacing Biases in Large Language Models using Contrastive Input
Decoding
- URL: http://arxiv.org/abs/2305.07378v1
- Date: Fri, 12 May 2023 11:09:49 GMT
- Title: Surfacing Biases in Large Language Models using Contrastive Input
Decoding
- Authors: Gal Yona, Or Honovich, Itay Laish, Roee Aharoni
- Abstract summary: Contrastive Input Decoding (CID) is a decoding algorithm to generate text given two inputs.
We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies.
- Score: 12.694066526722203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring that large language models (LMs) are fair, robust and useful
requires an understanding of how different modifications to their inputs impact
the model's behaviour. In the context of open-text generation tasks, however,
such an evaluation is not trivial. For example, when introducing a model with
an input text and a perturbed, "contrastive" version of it, meaningful
differences in the next-token predictions may not be revealed with standard
decoding strategies. With this motivation in mind, we propose Contrastive Input
Decoding (CID): a decoding algorithm to generate text given two inputs, where
the generated text is likely given one input but unlikely given the other. In
this way, the contrastive generations can highlight potentially subtle
differences in how the LM output differs for the two inputs in a simple and
interpretable manner. We use CID to highlight context-specific biases that are
hard to detect with standard decoding strategies and quantify the effect of
different input perturbations.
Related papers
- Vulnerability of LLMs to Vertically Aligned Text Manipulations [108.6908427615402]
Large language models (LLMs) have become highly effective at performing text classification tasks.
modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks.
Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input?
arXiv Detail & Related papers (2024-10-26T00:16:08Z) - Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models [0.0]
This paper introduces an efficient and robust method for discovering interpretable circuits in large language models.
We propose training sparse autoencoders on carefully designed positive and negative examples.
Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability.
arXiv Detail & Related papers (2024-05-21T06:26:10Z) - Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text
Generation [5.304395026626743]
Hallucination of text ungrounded in the input is a well-known problem in neural data-to-text generation.
We propose a new way to mitigate hallucinations by combining the probabilistic output of a generator language model with the output of a special "text critic"
Our method does not need any changes to the underlying LM's architecture or training procedure.
arXiv Detail & Related papers (2023-10-25T20:05:07Z) - Contrastive Decoding Improves Reasoning in Large Language Models [55.16503283583076]
We show that Contrastive Decoding achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks.
We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.
arXiv Detail & Related papers (2023-09-17T00:29:32Z) - Code Difference Guided Adversarial Example Generation for Deep Code
Models [25.01072108219646]
Adversarial examples are important to test and enhance the robustness of deep code models.
We propose a novel adversarial example generation technique (i.e., CODA) for testing deep code models.
arXiv Detail & Related papers (2023-01-06T08:03:56Z) - Contrastive Decoding: Open-ended Text Generation as Optimization [153.35961722855686]
We propose contrastive decoding (CD), a reliable decoding approach.
It is inspired by the fact that the failures of larger LMs are even more prevalent in smaller LMs.
CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone.
arXiv Detail & Related papers (2022-10-27T00:58:21Z) - FAST: Improving Controllability for Text Generation with Feedback Aware
Self-Training [25.75982440355576]
Controllable text generation systems often leverage control codes to direct various properties of the output like style and length.
Inspired by recent work on causal inference for NLP, this paper reveals a previously overlooked flaw in these control code-based conditional text generation algorithms.
We propose two simple techniques to reduce these correlations in training sets.
arXiv Detail & Related papers (2022-10-06T19:00:51Z) - On Measuring Social Biases in Prompt-Based Multi-Task Learning [1.3270286124913757]
We study T0, a large-scale multi-task text-to-text language model trained using prompt-based learning.
We consider two different forms of semantically equivalent inputs: question-answer format and premise-hypothesis format.
arXiv Detail & Related papers (2022-05-23T20:01:20Z) - On Decoding Strategies for Neural Text Generators [73.48162198041884]
We study the interaction between language generation tasks and decoding strategies.
We measure changes in attributes of generated text as a function of both decoding strategy and task.
Our results reveal both previously-observed and surprising findings.
arXiv Detail & Related papers (2022-03-29T16:25:30Z) - Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models.
This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.