Improving Nonlinear Projection Heads using Pretrained Autoencoder Embeddings
- URL: http://arxiv.org/abs/2408.14514v1
- Date: Sun, 25 Aug 2024 11:10:33 GMT
- Title: Improving Nonlinear Projection Heads using Pretrained Autoencoder Embeddings
- Authors: Andreas Schliebitz, Heiko Tapken, Martin Atzmueller,
- Abstract summary: Using a pretrained autoencoder embedding in the projector can increase classification accuracy by up to 2.9% or 1.7% on average.
Our results also suggest, that using the sigmoid and tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy.
- Score: 0.10241134756773229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This empirical study aims at improving the effectiveness of the standard 2-layer MLP projection head $g(\cdot)$ featured in the SimCLR framework through the use of pretrained autoencoder embeddings. Given a contrastive learning task with a largely unlabeled image classification dataset, we first train a shallow autoencoder architecture and extract its compressed representations contained in the encoder's embedding layer. After freezing the weights within this pretrained layer, we use it as a drop-in replacement for the input layer of SimCLR's default projector. Additionally, we also apply further architectural changes to the projector by decreasing its width and changing its activation function. The different projection heads are then used to contrastively train and evaluate a feature extractor $f(\cdot)$ following the SimCLR protocol, while also examining the performance impact of Z-score normalized datasets. Our experiments indicate that using a pretrained autoencoder embedding in the projector can not only increase classification accuracy by up to 2.9% or 1.7% on average but can also significantly decrease the dimensionality of the projection space. Our results also suggest, that using the sigmoid and tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy. When applying our presented projectors, then not applying Z-score normalization to datasets often increases peak performance. In contrast, the default projection head can benefit more from normalization. All experiments involving our pretrained projectors are conducted with frozen embeddings, since our test results indicate an advantage compared to using their non-frozen counterparts.
Related papers
- Projection Head is Secretly an Information Bottleneck [33.755883011145755]
We develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective.
By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck.
Our methods exhibit consistent improvement in the downstream performance across various real-world datasets.
arXiv Detail & Related papers (2025-03-01T14:23:31Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Investigating the Benefits of Projection Head for Representation Learning [11.20245728716827]
An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations.
The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better?
We show that implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers.
arXiv Detail & Related papers (2024-03-18T00:48:58Z) - Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z) - Unraveling Projection Heads in Contrastive Learning: Insights from
Expansion and Shrinkage [9.540723320001621]
We aim to demystify the observed phenomenon where representations learned before projectors outperform those learned after.
We identify two crucial effects -- expansion and shrinkage -- induced by the contrastive loss on the projectors.
We propose a family of linear projectors to accurately model the projector's behavior.
arXiv Detail & Related papers (2023-06-06T01:13:18Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers.
We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z) - Understanding the Role of the Projector in Knowledge Distillation [22.698845243751293]
We revisit the efficacy of knowledge distillation as a function matching and metric learning problem.
We verify three important design decisions, namely the normalisation, soft maximum function, and projection layers.
We attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.
arXiv Detail & Related papers (2023-03-20T13:33:31Z) - Learnt Deep Hyperparameter selection in Adversarial Training for
compressed video enhancement with perceptual critic [0.0]
Deep Feature Quality Metrics (DFQMs) have been shown to better correlate with subjective perceptual scores over traditional metrics.
We present a new method for selecting perceptually relevant layers from such a network, based on a neuroscience interpretation of layer behaviour.
Our results show that the introduction of these selected features into the critic yields up to 10% (FID) and 15% (KID) performance increase.
arXiv Detail & Related papers (2023-02-28T12:10:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.