Emergence of Segmentation with Minimalistic White-Box Transformers
- URL: http://arxiv.org/abs/2308.16271v1
- Date: Wed, 30 Aug 2023 19:02:17 GMT
- Title: Emergence of Segmentation with Minimalistic White-Box Transformers
- Authors: Yaodong Yu, Tianzhe Chu, Shengbang Tong, Ziyang Wu, Druv Pai, Sam
Buchanan, Yi Ma
- Abstract summary: Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks.
In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms.
Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable.
- Score: 22.688777622988795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-like models for vision tasks have recently proven effective for a
wide range of downstream applications such as segmentation and detection.
Previous works have shown that segmentation properties emerge in vision
transformers (ViTs) trained using self-supervised methods such as DINO, but not
in those trained on supervised classification tasks. In this study, we probe
whether segmentation emerges in transformer-based models solely as a result of
intricate self-supervised learning mechanisms, or if the same emergence can be
achieved under much broader conditions through proper design of the model
architecture. Through extensive experimental results, we demonstrate that when
employing a white-box transformer-like architecture known as CRATE, whose
design explicitly models and pursues low-dimensional structures in the data
distribution, segmentation properties, at both the whole and parts levels,
already emerge with a minimalistic supervised training recipe. Layer-wise
finer-grained analysis reveals that the emergent properties strongly
corroborate the designed mathematical functions of the white-box network. Our
results suggest a path to design white-box foundation models that are
simultaneously highly performant and mathematically fully interpretable. Code
is at \url{https://github.com/Ma-Lab-Berkeley/CRATE}.
Related papers
- Cliqueformer: Model-Based Optimization with Structured Transformers [102.55764949282906]
We develop a model that learns the structure of an MBO task and empirically leads to improved designs.
We evaluate Cliqueformer on various tasks, ranging from high-dimensional black-box functions to real-world tasks of chemical and genetic design.
arXiv Detail & Related papers (2024-10-17T00:35:47Z) - Investigating Self-Supervised Methods for Label-Efficient Learning [27.029542823306866]
We study different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities.
We introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks.
When testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
arXiv Detail & Related papers (2024-06-25T10:56:03Z) - On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier [20.17288970927518]
We study the similarity of representations between the hidden layers of individual transformers.
We propose an aligned training approach to enhance the similarity between internal representations.
arXiv Detail & Related papers (2024-06-20T16:41:09Z) - Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator.
We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z) - Masked Completion via Structured Diffusion with White-Box Transformers [23.07048591213815]
We provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning.
We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture.
CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets.
arXiv Detail & Related papers (2024-04-03T04:23:01Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - Understanding Dynamics of Nonlinear Representation Learning and Its
Application [12.697842097171119]
We study the dynamics of implicit nonlinear representation learning.
We show that the data-architecture alignment condition is sufficient for the global convergence.
We derive a new training framework, which satisfies the data-architecture alignment condition without assuming it.
arXiv Detail & Related papers (2021-06-28T16:31:30Z) - Improving Label Quality by Jointly Modeling Items and Annotators [68.8204255655161]
We propose a fully Bayesian framework for learning ground truth labels from noisy annotators.
Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model.
arXiv Detail & Related papers (2021-06-20T02:15:20Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z) - Unsupervised Learning Consensus Model for Dynamic Texture Videos
Segmentation [12.462608802359936]
We present an effective unsupervised learning consensus model for the segmentation of dynamic texture (ULCM)
In the proposed model, the set of values of the requantized local binary patterns (LBP) histogram around the pixel to be classified are used as features.
Experiments conducted on the challenging SynthDB dataset show that ULCM is significantly faster, easier to code, simple and has limited parameters.
arXiv Detail & Related papers (2020-06-29T16:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.