Related papers: Uncovering Bias in Large Vision-Language Models with Counterfactuals

Uncovering Bias in Large Vision-Language Models with Counterfactuals

URL: http://arxiv.org/abs/2404.00166v2
Date: Fri, 7 Jun 2024 23:29:19 GMT
Title: Uncovering Bias in Large Vision-Language Models with Counterfactuals
Authors: Phillip Howard, Anahita Bhiwandiwalla, Kathleen C. Fraser, Svetlana Kiritchenko,
Abstract summary: We study the social biases contained in text generated by Large Vision-Language Models (LVLMs) We present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets. We find that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence toxicity and the generation of competency-associated words.
Score: 8.414108895243148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different LVLMs under this counterfactual generation setting and find that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence toxicity and the generation of competency-associated words.

Related papers

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images. We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs [21.08821957575833]
We create a novel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images. Comprehensive experiments show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona.
arXiv Detail & Related papers (2025-02-27T20:25:00Z)
VilBias: A Study of Bias Detection through Linguistic and Visual Cues , presenting Annotation Strategies, Evaluation, and Key Challenges [2.2751168722976587]
VLBias is a framework that leverages state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect linguistic and visual biases in news content. We present a multimodal dataset comprising textual content and corresponding images from diverse news sources.
arXiv Detail & Related papers (2024-12-22T15:05:30Z)
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels. We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals [8.41410889524315]
We study the social biases contained in text generated by Large Vision-Language Models (LVLMs) We present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets. We evaluate the text produced by different models under this counterfactual generation setting at scale, producing over 57 million responses from popular LVLMs.
arXiv Detail & Related papers (2024-05-30T15:27:56Z)
StoryGPT-V: Large Language Models as Consistent Story Visualizers [39.790319429455856]
generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references. We introduce textbfStoryGPT-V, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters.
arXiv Detail & Related papers (2023-12-04T18:14:29Z)
Mitigating Hallucination in Visual Language Models with Visual Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot. Key problem lies in its weak ability to comprehend detailed content in a multi-modal context. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.