ViTCAE: ViT-based Class-conditioned Autoencoder
- URL: http://arxiv.org/abs/2509.16554v1
- Date: Sat, 20 Sep 2025 06:48:45 GMT
- Title: ViTCAE: ViT-based Class-conditioned Autoencoder
- Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever,
- Abstract summary: Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms.<n>This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin.
- Score: 8.844699137494105
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch-level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence-aware temperature scheduler that adaptively anneals each head's influence function based on its distributional stability. This process enables a principled head-freezing mechanism, guided by theoretically-grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi-agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer-based generation.
Related papers
- Smart Sensor Placement: A Correlation-Aware Attribution Framework (CAAF) for Real-world Data Modeling [11.354527723215568]
Optimal sensor placement (OSP) is critical for efficient, accurate monitoring, control, and inference in real-world systems.<n>We propose a machine-learning-based feature attribution framework to identify OSP for the prediction of quantities of interest.
arXiv Detail & Related papers (2025-10-26T03:50:16Z) - Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers [6.853513140582486]
We introduce ASAC (Attention-based Attention Control), which integrates the attention schema concept into artificial neural networks.<n>By explicitly modeling attention allocation, our approach aims to enhance system efficiency.<n>We demonstrate ASAC's effectiveness in both the vision and NLP domains, highlighting its ability to improve classification accuracy and expedite the learning process.
arXiv Detail & Related papers (2025-09-19T15:08:30Z) - ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification [51.07970070817353]
An ideal time series classification (TSC) should be able to capture invariant representations.<n>Current methods are largely unguided, lacking the semantic direction required to isolate truly universal features.<n>We propose an end-to-end Energy-Regularized Information for Shift-Robustness framework to enable guided and reliable feature disentanglement.
arXiv Detail & Related papers (2025-08-19T12:13:41Z) - Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation [8.624395048491275]
We propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA)<n>ADSA identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention.<n>We also introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50%$.
arXiv Detail & Related papers (2025-06-23T01:27:06Z) - DEAL: Disentangling Transformer Head Activations for LLM Steering [19.770342907146965]
We propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers.<n>For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations.<n>We assess the behavioral relevance of each head by the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses.
arXiv Detail & Related papers (2025-06-10T02:16:50Z) - Enhancing Transformers Through Conditioned Embedded Tokens [28.80560770188464]
We develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data.<n>We introduce conditioned tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism.<n>Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training.
arXiv Detail & Related papers (2025-05-19T07:21:53Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z) - Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline
Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms.
Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z) - Is Disentanglement enough? On Latent Representations for Controllable
Music Generation [78.8942067357231]
In the absence of a strong generative decoder, disentanglement does not necessarily imply controllability.
The structure of the latent space with respect to the VAE-decoder plays an important role in boosting the ability of a generative model to manipulate different attributes.
arXiv Detail & Related papers (2021-08-01T18:37:43Z) - Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)
We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information.
We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z) - Hierarchical Variational Autoencoder for Visual Counterfactuals [79.86967775454316]
Conditional Variational Autos (VAE) are gathering significant attention as an Explainable Artificial Intelligence (XAI) tool.
In this paper we show how relaxing the effect of the posterior leads to successful counterfactuals.
We introduce VAEX an Hierarchical VAE designed for this approach that can visually audit a classifier in applications.
arXiv Detail & Related papers (2021-02-01T14:07:11Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.