Related papers: Memorization in Self-Supervised Learning Improves Downstream Generalization

Memorization in Self-Supervised Learning Improves Downstream Generalization

URL: http://arxiv.org/abs/2401.12233v3
Date: Tue, 18 Jun 2024 14:49:32 GMT
Title: Memorization in Self-Supervised Learning Improves Downstream Generalization
Authors: Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch,
Abstract summary: Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data. We propose SSLMem, a framework for defining memorization within SSL.
Score: 49.42010047574022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data-often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations-both known in supervised learning as regularization techniques that reduce overfitting-still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.

Related papers

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning [50.45435841411193]
Code Language Models (CLMs) exhibit unintended memorization of sensitive training data, enabling verbatim reproduction of confidential information when specifically prompted.<n>We introduce CodeEraser, an advanced variant that selectively unlearns sensitive memorized segments in code while preserving the structural integrity and functional correctness of the surrounding code.
arXiv Detail & Related papers (2025-09-17T07:12:35Z)
Co-Learning: Towards Semi-Supervised Object Detection with Road-side Cameras [1.5495593104596401]
Semi-supervised learning (SSL) can train object detectors using labeled and unlabeled data. SSL faces several challenges, including pseudo-target inconsistencies, disharmony between classification and regression tasks, and efficient use of abundant unlabeled data. We develop a teacher-student-based SSL framework, Co-Learning, which employs mutual learning and annotation-alignment strategies to adeptly navigate these complexities.
arXiv Detail & Related papers (2024-11-28T13:42:55Z)
Localizing Memorization in SSL Vision Encoders [24.681788021239118]
We propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem) We find that while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder.
arXiv Detail & Related papers (2024-09-27T18:11:00Z)
Context-Aware Predictive Coding: A Representation Learning Framework for WiFi Sensing [0.0]
WiFi sensing is an emerging technology that utilizes wireless signals for various sensing applications. In this paper, we introduce a novel SSL framework called Context-Aware Predictive Coding (CAPC) CAPC effectively learns from unlabelled data and adapts to diverse environments. Our evaluations demonstrate that CAPC not only outperforms other SSL methods and supervised approaches, but also achieves superior generalization capabilities.
arXiv Detail & Related papers (2024-09-16T17:59:49Z)
A Survey of the Self Supervised Learning Mechanisms for Vision Transformers [5.152455218955949]
The application of self supervised learning (SSL) in vision tasks has gained significant attention. We develop a comprehensive taxonomy of systematically classifying the SSL techniques. We discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field.
arXiv Detail & Related papers (2024-08-30T07:38:28Z)
Reverse Engineering Self-Supervised Learning [17.720366509919167]
Self-supervised learning (SSL) is a powerful tool in machine learning. This paper presents an in-depth empirical analysis of SSL-trained representations.
arXiv Detail & Related papers (2023-05-24T23:15:28Z)
Does Decentralized Learning with Non-IID Unlabeled Data Benefit from Self Supervision? [51.00034621304361]
We study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL) We study the effectiveness of contrastive learning algorithms under decentralized learning settings.
arXiv Detail & Related papers (2022-10-20T01:32:41Z)
Toward a Geometrical Understanding of Self-supervised Contrastive Learning [55.83778629498769]
Self-supervised learning (SSL) is one of the premier techniques to create data representations that are actionable for transfer learning in the absence of human annotations. Mainstream SSL techniques rely on a specific deep neural network architecture with two cascaded neural networks: the encoder and the projector. In this paper, we investigate how the strength of the data augmentation policies affects the data embedding.
arXiv Detail & Related papers (2022-05-13T23:24:48Z)
Federated Cycling (FedCy): Semi-supervised Federated Learning of Surgical Phases [57.90226879210227]
FedCy is a semi-supervised learning (FSSL) method that combines FL and self-supervised learning to exploit a decentralized dataset of both labeled and unlabeled videos. We demonstrate significant performance gains over state-of-the-art FSSL methods on the task of automatic recognition of surgical phases.
arXiv Detail & Related papers (2022-03-14T17:44:53Z)
Self-supervised Learning is More Robust to Dataset Imbalance [65.84339596595383]
We investigate self-supervised learning under dataset imbalance. Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
arXiv Detail & Related papers (2021-10-11T06:29:56Z)
ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications. We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN) We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.