Learning an Ensemble Token from Task-driven Priors in Facial Analysis
- URL: http://arxiv.org/abs/2507.01290v1
- Date: Wed, 02 Jul 2025 02:07:31 GMT
- Title: Learning an Ensemble Token from Task-driven Priors in Facial Analysis
- Authors: Sunyong Seo, Semin Kim, Jongha Lee,
- Abstract summary: We introduce ET-Fuser, a novel methodology for learning ensemble token.<n>We propose a robust prior unification learning method that generates a ensemble token within a self-attention mechanism.<n>Our results show improvements across a variety of facial analysis, with statistically significant enhancements observed in the feature representations.
- Score: 1.4228349888743608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. Although the generalization of conventional methodologies has advanced visual interpretability, there remains paucity of research that preserves the unified feature representation on single task learning during the training process. In this work, we introduce ET-Fuser, a novel methodology for learning ensemble token by leveraging attention mechanisms based on task priors derived from pre-trained models for facial analysis. Specifically, we propose a robust prior unification learning method that generates a ensemble token within a self-attention mechanism, which shares the mutual information along the pre-trained encoders. This ensemble token approach offers high efficiency with negligible computational cost. Our results show improvements across a variety of facial analysis, with statistically significant enhancements observed in the feature representations.
Related papers
- "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Transition Network Analysis: A Novel Framework for Modeling, Visualizing, and Identifying the Temporal Patterns of Learners and Learning Processes [0.43981305860983705]
This paper presents a novel learning analytics method: Transition Network Analysis (TNA)<n>TNA integrates Process Mining and probabilistic graph representation to model, visualize, and identify transition patterns in the learning process data.<n>Future directions include -- inter alia -- expanding estimation methods, reliability assessment, and building longitudinal TNA.
arXiv Detail & Related papers (2024-11-23T07:54:15Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition [1.4374467687356276]
This paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and multi-view Fusion Attention mechanism for expression classification.
We suggest easy-to-implement and no-training frameworks aimed at highlighting key facial features to determine if such features can serve as guides for the model.
The efficacy of this method is validated by improvements in model performance on the Aff-wild2 dataset.
arXiv Detail & Related papers (2024-03-19T16:21:47Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - A Quantitative Approach to Predicting Representational Learning and
Performance in Neural Networks [5.544128024203989]
Key property of neural networks is how they learn to represent and manipulate input information in order to solve a task.
We introduce a new pseudo-kernel based tool for analyzing and predicting learned representations.
arXiv Detail & Related papers (2023-07-14T18:39:04Z) - Discovering Object-Centric Generalized Value Functions From Pixels [17.10287710842919]
We introduce a method that tries to discover meaningful features from objects, translating them to temporally coherent "question" functions.
We also investigate the discovered general value functions and show that the learned representations are not only interpretable but also, centered around objects that are invariant to changes across tasks facilitating fast adaptation.
arXiv Detail & Related papers (2023-04-27T00:34:24Z) - Multivariate Business Process Representation Learning utilizing Gramian
Angular Fields and Convolutional Neural Networks [0.0]
Learning meaningful representations of data is an important aspect of machine learning.
For predictive process analytics, it is essential to have all explanatory characteristics of a process instance available.
We propose a novel approach for representation learning of business process instances.
arXiv Detail & Related papers (2021-06-15T10:21:14Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z) - Heterogeneous Contrastive Learning: Encoding Spatial Information for
Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations.
We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.