When Text and Images Don't Mix: Bias-Correcting Language-Image Similarity Scores for Anomaly Detection
- URL: http://arxiv.org/abs/2407.17083v1
- Date: Wed, 24 Jul 2024 08:20:02 GMT
- Title: When Text and Images Don't Mix: Bias-Correcting Language-Image Similarity Scores for Anomaly Detection
- Authors: Adam Goodge, Bryan Hooi, Wee Siong Ng,
- Abstract summary: We show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective.
We propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs.
- Score: 35.09035417676343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) achieves remarkable performance in various downstream tasks through the alignment of image and text input embeddings and holds great promise for anomaly detection. However, our empirical experiments show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective to align image-text input pairs. We show that this phenomenon induces a `similarity bias' - in which false negative and false positive errors occur due to bias in the similarities between images and the normal label text embeddings. To address this bias, we propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs. BLISS is simple, it does not require strong inductive biases about anomalous behaviour nor an expensive training process, and it significantly outperforms baseline methods on benchmark image datasets, even when access to normal data is extremely limited.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - AITTI: Learning Adaptive Inclusive Token for Text-to-Image Generation [53.65701943405546]
We learn adaptive inclusive tokens to shift the attribute distribution of the final generative outputs.
Our method requires neither explicit attribute specification nor prior knowledge of the bias distribution.
Our method achieves comparable performance to models that require specific attributes or editing directions for generation.
arXiv Detail & Related papers (2024-06-18T17:22:23Z) - Common-Sense Bias Discovery and Mitigation for Classification Tasks [16.8259488742528]
We propose a framework to extract feature clusters in a dataset based on image descriptions.
The analyzed features and correlations are human-interpretable, so we name the method Common-Sense Bias Discovery (CSBD)
Experiments show that our method discovers novel biases on multiple classification tasks for two benchmark image datasets.
arXiv Detail & Related papers (2024-01-24T03:56:07Z) - Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z) - Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic
Contrast Sets [52.77024349608834]
Vision-language models can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet.
COCO Captions is the most commonly used dataset for evaluating bias between background context and the gender of people in-situ.
We propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets.
arXiv Detail & Related papers (2023-05-24T17:59:18Z) - Mitigating Test-Time Bias for Fair Image Retrieval [18.349154934096784]
We address the challenge of generating fair and unbiased image retrieval results given neutral textual queries.
We introduce a straightforward technique, Post-hoc Bias Mitigation, that post-processes the outputs from the pre-trained vision-language model.
Our approach achieves the lowest bias, compared with various existing bias-mitigation methods, in text-based image retrieval result.
arXiv Detail & Related papers (2023-05-23T21:31:16Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - Data Generation using Texture Co-occurrence and Spatial Self-Similarity
for Debiasing [6.976822832216875]
We propose a novel de-biasing approach that explicitly generates additional images using texture representations of oppositely labeled images.
Every new generated image contains similar spatial information from a source image while transferring textures from a target image of opposite label.
Our model integrates a texture co-occurrence loss that determines whether a generated image's texture is similar to that of the target, and a spatial self-similarity loss that determines whether the spatial details between the generated and source images are well preserved.
arXiv Detail & Related papers (2021-10-15T08:04:59Z) - An Unsupervised Sampling Approach for Image-Sentence Matching Using
Document-Level Structural Information [64.66785523187845]
We focus on the problem of unsupervised image-sentence matching.
Existing research explores to utilize document-level structural information to sample positive and negative instances for model training.
We propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples.
arXiv Detail & Related papers (2021-03-21T05:43:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.