Related papers: Manipulating and Mitigating Generative Model Biases without Retraining

Manipulating and Mitigating Generative Model Biases without Retraining

URL: http://arxiv.org/abs/2404.02530v2
Date: Tue, 17 Sep 2024 01:07:58 GMT
Title: Manipulating and Mitigating Generative Model Biases without Retraining
Authors: Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian,
Abstract summary: We propose a dynamic and computationally efficient manipulation of T2I model biases by exploiting their rich language embedding spaces without model retraining. We show that leveraging foundational vector algebra allows for a convenient control over language model embeddings to shift T2I model outputs. As a by-product, this control serves as a form of precise prompt engineering to generate images which are generally implausible using regular text prompts.
Score: 49.60774626839712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) generative models have gained increased popularity in the public domain. While boasting impressive user-guided generative abilities, their black-box nature exposes users to intentionally- and intrinsically-biased outputs. Bias manipulation (and mitigation) techniques typically rely on careful tuning of learning parameters and training data to adjust decision boundaries to influence model bias characteristics, which is often computationally demanding. We propose a dynamic and computationally efficient manipulation of T2I model biases by exploiting their rich language embedding spaces without model retraining. We show that leveraging foundational vector algebra allows for a convenient control over language model embeddings to shift T2I model outputs and control the distribution of generated classes. As a by-product, this control serves as a form of precise prompt engineering to generate images which are generally implausible using regular text prompts. We demonstrate a constructive application of our technique by balancing the frequency of social classes in generated images, effectively balancing class distributions across three social bias dimensions. We also highlight a negative implication of bias manipulation by framing our method as a backdoor attack with severity control using semantically-null input triggers, reporting up to 100% attack success rate. Key-words: Text-to-Image Models, Generative Models, Bias, Prompt Engineering, Backdoor Attacks

Related papers

AutoDebias: Automated Framework for Debiasing Text-to-Image Models [6.581606189725493]
Text-to-Image (T2I) models generate high-quality images from text prompts but often exhibit unintended social biases.<n>We propose AutoDebias, a framework that automatically identifies and mitigates harmful biases in T2I models without prior knowledge of specific bias types.<n>We evaluate the framework on a benchmark covering over 25 bias scenarios, including challenging cases where multiple biases occur simultaneously.
arXiv Detail & Related papers (2025-08-01T09:05:45Z)
Implicit Bias Injection Attacks against Text-to-Image Diffusion Models [17.131167390657243]
biased T2I models can generate content with specific tendencies, potentially influencing people's perceptions. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways. We propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models.
arXiv Detail & Related papers (2025-04-02T15:24:12Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement [3.0820287240219795]
We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods.
arXiv Detail & Related papers (2024-04-18T00:41:32Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models [54.19289900203071]
The rise in popularity of text-to-image generative artificial intelligence has attracted widespread public interest. We demonstrate that this technology can be attacked to generate content that subtly manipulates its users. We propose a Backdoor Attack on text-to-image Generative Models (BAGM) Our attack is the first to target three popular text-to-image generative models across three stages of the generative process.
arXiv Detail & Related papers (2023-07-31T08:34:24Z)
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness [15.059419033330126]
We present a novel strategy, called Fair Diffusion, to attenuate biases after the deployment of generative text-to-image models. Specifically, we demonstrate shifting a bias, based on human instructions, in any direction yielding arbitrarily new proportions for, e.g., identity groups. This introduced control enables instructing generative image models on fairness, with no data filtering and additional training required.
arXiv Detail & Related papers (2023-02-07T18:25:28Z)
Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z)
Better sampling in explanation methods can prevent dieselgate-like deception [0.0]
Interpretability of prediction models is necessary to determine their biases and causes of errors. Popular techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions. We show that the improved sampling increases the robustness of the LIME and SHAP, while previously untested method IME is already the most robust of all.
arXiv Detail & Related papers (2021-01-26T13:41:37Z)
Learning from others' mistakes: Avoiding dataset biases without modeling them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z)
Backdoor Attacks against Transfer Learning with Pre-trained Deep Learning Models [23.48763375455514]
Transfer learning provides an effective solution for feasibly and fast customize accurate textitStudent models. Many pre-trained Teacher models are publicly available and maintained by public platforms, increasing their vulnerability to backdoor attacks. We demonstrate a backdoor threat to transfer learning tasks on both image and time-series data leveraging the knowledge of publicly accessible Teacher models.
arXiv Detail & Related papers (2020-01-10T01:31:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.