Related papers: DILEMMA: Self-Supervised Shape and Texture Learning with Transformers

DILEMMA: Self-Supervised Shape and Texture Learning with Transformers

URL: http://arxiv.org/abs/2204.04788v1
Date: Sun, 10 Apr 2022 22:58:02 GMT
Title: DILEMMA: Self-Supervised Shape and Texture Learning with Transformers
Authors: Sepehr Sameni, Simon Jenni, Paolo Favaro
Abstract summary: We propose a pseudo-task to explicitly boost both shape and texture discriminability in models trained via self-supervised learning. We call our method DILEMMA, which stands for Detection of Incorrect Location EMbeddings with MAsked inputs.
Score: 33.296154476701055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is a growing belief that deep neural networks with a shape bias may exhibit better generalization capabilities than models with a texture bias, because shape is a more reliable indicator of the object category. However, we show experimentally that existing measures of shape bias are not stable predictors of generalization and argue that shape discrimination should not come at the expense of texture discrimination. Thus, we propose a pseudo-task to explicitly boost both shape and texture discriminability in models trained via self-supervised learning. For this purpose, we train a ViT to detect which input token has been combined with an incorrect positional embedding. To retain texture discrimination, the ViT is also trained as in MoCo with a student-teacher architecture and a contrastive loss over an extra learnable class token. We call our method DILEMMA, which stands for Detection of Incorrect Location EMbeddings with MAsked inputs. We evaluate our method through fine-tuning on several datasets and show that it outperforms MoCoV3 and DINO. Moreover, we show that when downstream tasks are strongly reliant on shape (such as in the YOGA-82 pose dataset), our pre-trained features yield a significant gain over prior work. Code will be released upon publication.

Related papers

Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation [2.8281887612574153]
We show that meaningful structure in the latent space does not emerge naturally.<n>We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives.<n>Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks.
arXiv Detail & Related papers (2025-09-17T09:23:52Z)
MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning [1.534667887016089]
deep neural networks (DNNs) are vulnerable to slight adversarial perturbations. We show that strong feature representation learning during training can significantly enhance the original model's robustness. We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations.
arXiv Detail & Related papers (2024-10-02T16:05:03Z)
Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences. We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision. Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z)
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
Uncertainty in Contrastive Learning: On the Predictability of Downstream Performance [7.411571833582691]
We study whether the uncertainty of such a representation can be quantified for a single datapoint in a meaningful way. We show that this goal can be achieved by directly estimating the distribution of the training data in the embedding space.
arXiv Detail & Related papers (2022-07-19T15:44:59Z)
Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque. Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z)
Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution. This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes. Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
Understanding Robustness in Teacher-Student Setting: A New Perspective [42.746182547068265]
Adrial examples are machine learning models where bounded adversarial perturbation could mislead the models to make arbitrarily incorrect predictions. Extensive studies try to explain the existence of adversarial examples and provide ways to improve model robustness. Our studies could shed light on the future exploration about adversarial examples, and enhancing model robustness via principled data augmentation.
arXiv Detail & Related papers (2021-02-25T20:54:24Z)
Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes [86.2129580231191]
Adjoint Rigid Transform (ART) Network is a neural module which can be integrated with a variety of 3D networks. ART learns to rotate input shapes to a learned canonical orientation, which is crucial for a lot of tasks. We will release our code and pre-trained models for further research.
arXiv Detail & Related papers (2021-02-01T20:58:45Z)
Trade-offs between membership privacy & adversarially robust learning [13.37805637358556]
We identify settings where standard models will overfit to a larger extent in comparison to robust models. The degree of overfitting naturally depends on the amount of data available for training.
arXiv Detail & Related papers (2020-06-08T14:20:12Z)
REST: Performance Improvement of a Black Box Model via RL-based Spatial Transformation [15.691668909002892]
We study robustness to geometric transformations in a specific condition where the black-box image classifier is given. We propose an additional learner, emphREinforcement Spatial Transform (REST), that transforms the warped input data into samples regarded as in-distribution by the black-box models.
arXiv Detail & Related papers (2020-02-16T16:15:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.