A$^{3}$lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment
for Dynamic Facial Expression Recognition with CLIP
- URL: http://arxiv.org/abs/2403.04294v1
- Date: Thu, 7 Mar 2024 07:43:04 GMT
- Title: A$^{3}$lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment
for Dynamic Facial Expression Recognition with CLIP
- Authors: Zeng Tao, Yan Wang, Junxiong Lin, Haoran Wang, Xinji Mai, Jiawen Yu,
Xuan Tong, Ziheng Zhou, Shaoqi Yan, Qing Zhao, Liyuan Han, Wenqiang Zhang
- Abstract summary: A$3$lign-DFER is a new DFER labeling paradigm to comprehensively achieve alignment.
Our A$3$lign-DFER method achieves state-of-the-art results on multiple DFER datasets, including DFEW, FERV39k, and MAFW.
- Score: 30.369339525599496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of CLIP in dynamic facial expression recognition (DFER) task
doesn't yield exceptional results as observed in other CLIP-based
classification tasks. While CLIP's primary objective is to achieve alignment
between images and text in the feature space, DFER poses challenges due to the
abstract nature of text and the dynamic nature of video, making label
representation limited and perfect alignment difficult. To address this issue,
we have designed A$^{3}$lign-DFER, which introduces a new DFER labeling
paradigm to comprehensively achieve alignment, thus enhancing CLIP's
suitability for the DFER task. Specifically, our A$^{3}$lign-DFER method is
designed with multiple modules that work together to obtain the most suitable
expanded-dimensional embeddings for classification and to achieve alignment in
three key aspects: affective, dynamic, and bidirectional. We replace the input
label text with a learnable Multi-Dimensional Alignment Token (MAT), enabling
alignment of text to facial expression video samples in both affective and
dynamic dimensions. After CLIP feature extraction, we introduce the Joint
Dynamic Alignment Synchronizer (JAS), further facilitating synchronization and
alignment in the temporal dimension. Additionally, we implement a Bidirectional
Alignment Training Paradigm (BAP) to ensure gradual and steady training of
parameters for both modalities. Our insightful and concise A$^{3}$lign-DFER
method achieves state-of-the-art results on multiple DFER datasets, including
DFEW, FERV39k, and MAFW. Extensive ablation experiments and visualization
studies demonstrate the effectiveness of A$^{3}$lign-DFER. The code will be
available in the future.
Related papers
- Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID [82.12123628480371]
Unsupervised person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning.
Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning.
We propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up objective for specific fine-grained patterns emphasized by each modality.
arXiv Detail & Related papers (2025-04-27T13:58:12Z) - econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS.
Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z) - SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition [4.670023983240585]
We propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels.
Our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines.
arXiv Detail & Related papers (2025-03-24T09:08:14Z) - EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.
EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z) - Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild [3.3905929183808796]
In-the-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions.
We propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD)
IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner.
arXiv Detail & Related papers (2024-12-17T18:45:53Z) - CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections [22.32157080294386]
We propose a label-free prompt-tuning method to enhance CLIP-based image classification performance using unlabeled images.
Our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFTer.
arXiv Detail & Related papers (2024-11-28T19:48:54Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding [9.048401253308123]
This paper investigates flexible organization and explicit correlation learning for multiple views.
We devise a nimble Transformer model, named emphVSFormer, to explicitly capture pairwise and higher-order correlations of all elements in the set.
It reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD.
arXiv Detail & Related papers (2024-09-14T01:48:54Z) - SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders [7.618223798662929]
We propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders.
We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty.
Experiments show that SA-DAVE produces improved performance over existing methods.
arXiv Detail & Related papers (2024-07-18T12:35:46Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - LEAF: Unveiling Two Sides of the Same Coin in Semi-supervised Facial Expression Recognition [56.22672276092373]
Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition.
We propose a unified framework termed hierarchicaL dEcoupling And Fusing to coordinate expression-relevant representations and pseudo-labels.
We show that LEAF outperforms state-of-the-art semi-supervised FER methods, effectively leveraging both labeled and unlabeled data.
arXiv Detail & Related papers (2024-04-23T13:43:33Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - CLIP Brings Better Features to Visual Aesthetics Learners [12.0962117940694]
Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure.
In this work, an unified and flexible two-phase textbfCLIP-based textbfSemi-supervised textbfKnowledge textbfDistillation paradigm is proposed, namely textbftextitCSKD.
arXiv Detail & Related papers (2023-07-28T16:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.