AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection
- URL: http://arxiv.org/abs/2503.23450v1
- Date: Sun, 30 Mar 2025 14:09:13 GMT
- Title: AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection
- Authors: Bohao Xing, Kaishen Yuan, Zitong Yu, Xin Liu, Heikki Kälviäinen,
- Abstract summary: Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing.<n>AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets.<n>We propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT.
- Score: 18.523909610800757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.
Related papers
- Decoupled Doubly Contrastive Learning for Cross Domain Facial Action Unit Detection [66.80386429324196]
We propose a decoupled doubly contrastive adaptation (D$2$CA) approach to learn a purified AU representation.<n>D$2$CA is trained to disentangle AU and domain factors by assessing the quality of synthesized faces.<n>It consistently outperforms state-of-the-art cross-domain AU detection approaches.
arXiv Detail & Related papers (2025-03-12T00:42:17Z) - Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample [53.23474626420103]
Facial action unit (AU) detection remains a challenging task, due to the subtlety, dynamics, and diversity of AUs.
We propose a novel AU detection framework called AC2D by adaptively constraining self-attention weight distribution.
Our method achieves competitive performance compared to state-of-the-art AU detection approaches on challenging benchmarks.
arXiv Detail & Related papers (2024-10-02T05:51:24Z) - Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection [1.0358639819750703]
In unsupervised anomaly detection (UAD) research, it is necessary to develop a computationally efficient and scalable solution.
We revisit the reconstruction-by-inpainting approach and rethink to improve it by analyzing strengths and weaknesses.
We propose Feature Attenuation of Defective Representation (FADeR) that only employs two layers which attenuates feature information of anomaly reconstruction.
arXiv Detail & Related papers (2024-07-05T15:44:53Z) - AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors [31.547624650827395]
Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters.
PETL provides a promising paradigm to address these challenges.
We propose a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism.
arXiv Detail & Related papers (2024-03-07T17:46:50Z) - Occlusion-Aware Detection and Re-ID Calibrated Network for Multi-Object
Tracking [38.36872739816151]
Occlusion-Aware Attention (OAA) module in the detector highlights the object features while suppressing the occluded background regions.
OAA can serve as a modulator that enhances the detector for some potentially occluded objects.
We design a Re-ID embedding matching block based on the optimal transport problem.
arXiv Detail & Related papers (2023-08-30T06:56:53Z) - Self-supervised Facial Action Unit Detection with Region and Relation
Learning [5.182661263082065]
We propose a novel self-supervised framework for AU detection with the region and relation learning.
An improved Optimal Transport (OT) algorithm is introduced to exploit the correlation characteristics among AUs.
Swin Transformer is exploited to model the long-distance dependencies within each AU region during feature learning.
arXiv Detail & Related papers (2023-03-10T05:22:45Z) - Weakly Supervised Regional and Temporal Learning for Facial Action Unit
Recognition [36.350407471391065]
We propose two auxiliary AU related tasks to bridge the gap between limited annotations and the model performance.
A single image based optical flow estimation task is proposed to leverage the dynamic change of facial muscles.
By incorporating semi-supervised learning, we propose an end-to-end trainable framework named weakly supervised regional and temporal learning.
arXiv Detail & Related papers (2022-04-01T12:02:01Z) - Test-time Adaptation with Slot-Centric Models [63.981055778098444]
Slot-TTA is a semi-supervised scene decomposition model that at test time is adapted per scene through gradient descent on reconstruction or cross-view synthesis objectives.
We show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors, and alternative test-time adaptation methods.
arXiv Detail & Related papers (2022-03-21T17:59:50Z) - Robust and Precise Facial Landmark Detection by Self-Calibrated Pose
Attention Network [73.56802915291917]
We propose a semi-supervised framework to achieve more robust and precise facial landmark detection.
A Boundary-Aware Landmark Intensity (BALI) field is proposed to model more effective facial shape constraints.
A Self-Calibrated Pose Attention (SCPA) model is designed to provide a self-learned objective function that enforces intermediate supervision.
arXiv Detail & Related papers (2021-12-23T02:51:08Z) - Meta Auxiliary Learning for Facial Action Unit Detection [84.22521265124806]
We consider learning AU detection and facial expression recognition in a multi-task manner.
The performance of the AU detection task cannot be always enhanced due to the negative transfer in the multi-task scenario.
We propose a Meta Learning method (MAL) that automatically selects highly related FE samples by learning adaptative weights for the training FE samples in a meta learning manner.
arXiv Detail & Related papers (2021-05-14T02:28:40Z) - J$\hat{\text{A}}$A-Net: Joint Facial Action Unit Detection and Face
Alignment via Adaptive Attention [57.51255553918323]
We propose a novel end-to-end deep learning framework for joint AU detection and face alignment.
Our framework significantly outperforms the state-of-the-art AU detection methods on the challenging BP4D, DISFA, GFT and BP4D+ benchmarks.
arXiv Detail & Related papers (2020-03-18T12:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.