Captured by Captions: On Memorization and its Mitigation in CLIP Models
- URL: http://arxiv.org/abs/2502.07830v1
- Date: Tue, 11 Feb 2025 00:11:13 GMT
- Title: Captured by Captions: On Memorization and its Mitigation in CLIP Models
- Authors: Wenhao Wang, Adam Dziedzic, Grace C. Kim, Michael Backes, Franziska Boenisch,
- Abstract summary: We propose a formal definition of memorization in CLIP and use it to quantify memorization in CLIP models.
Our results indicate that CLIP's memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization.
We find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain.
- Score: 23.005901198213966
- License:
- Abstract: Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP's memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility--something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.
Related papers
- Analyzing Memorization in Large Language Models through the Lens of Model Attribution [11.295483963637217]
Large Language Models (LLMs) are prevalent in modern applications but often memorize training data, leading to privacy breaches and copyright issues.
We investigate memorization from an architectural lens by analyzing how attention modules at different layers impact its memorization and generalization.
arXiv Detail & Related papers (2025-01-09T09:00:32Z) - Detecting Memorization in Large Language Models [0.0]
Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data.
Traditional methods for detecting memorization rely on output probabilities or loss functions.
We introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM.
arXiv Detail & Related papers (2024-12-02T00:17:43Z) - Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications.
We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences.
We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z) - Memorization in Self-Supervised Learning Improves Downstream Generalization [49.42010047574022]
Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data.
We propose SSLMem, a framework for defining memorization within SSL.
arXiv Detail & Related papers (2024-01-19T11:32:47Z) - Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z) - Exploring Memorization in Fine-tuned Language Models [53.52403444655213]
We conduct the first comprehensive analysis to explore language models' memorization during fine-tuning across tasks.
Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks.
We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.
arXiv Detail & Related papers (2023-10-10T15:41:26Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Mitigating Approximate Memorization in Language Models via Dissimilarity
Learned Policy [0.0]
Large Language models (LLMs) are trained on large amounts of data.
LLMs showed to memorize parts of the training data and emit those data verbatim when an adversary prompts appropriately.
arXiv Detail & Related papers (2023-05-02T15:53:28Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.