Related papers: MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition

MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition

URL: http://arxiv.org/abs/2412.18988v1
Date: Wed, 25 Dec 2024 21:52:31 GMT
Title: MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition
Authors: Peihao Xiang, Kaida Wu, Chaohao Lin, Ou Bai,
Abstract summary: This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition.<n>We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition.
Score: 0.19285000127136376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition, which enhances the model's generalization ability. After we conduct extensive ablation experiments and comparison with state-of-the-art (SOTA) methods on various public datasets for dynamic facial expression recognition, the robustness of the MTCAE-DFER model and the effectiveness of global-local dynamic feature interaction among related tasks have been proven.

Related papers

MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition [2.7745600113170994]
Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments.<n>We propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF)<n>The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views.
arXiv Detail & Related papers (2025-04-03T05:04:05Z)
MVCNet: Multi-View Contrastive Network for Motor Imagery Classification [20.78236894605647]
Motor imagery (MI) decoding has received significant attention due to its intuitive mechanism. Most existing models rely on single-stream architectures and overlook the multi-view nature of EEG signals, leading to limited performance and generalization. We propose a multi-view contrastive network (MVCNet), a dual-branch architecture that parallelly integrates CNN and Transformer models to capture both local spatial-temporal features and global temporal dependencies.
arXiv Detail & Related papers (2025-02-18T10:30:53Z)
EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation [17.0226030258296]
Associating driver attention with driving scene across two fields of views is a hard cross-domain perception problem. Previous methods typically focus on a single view or map attention to the scene via estimated gaze. We propose a novel method for end-to-end scene-associated driver attention estimation, called EraWNet.
arXiv Detail & Related papers (2024-08-16T07:12:47Z)
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training [11.746833714322154]
We introduce a new concept, Dynamic Appearance (DA), summarizing the appearance information relating to movement in a video. We consider distilling the dynamic appearance from raw video data as a means of efficient video understanding. We provide extensive experimental results on four action recognition benchmarks.
arXiv Detail & Related papers (2022-11-23T07:16:16Z)
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity. We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically. A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z)
Video Coding for Machine: Compact Visual Representation Compression for Intelligent Collaborative Analytics [101.35754364753409]
Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression. This paper summarizes VCM methodology and philosophy based on existing academia and industrial efforts.
arXiv Detail & Related papers (2021-10-18T12:42:13Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding. We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.