FER-former: Multi-modal Transformer for Facial Expression Recognition
- URL: http://arxiv.org/abs/2303.12997v1
- Date: Thu, 23 Mar 2023 02:29:53 GMT
- Title: FER-former: Multi-modal Transformer for Facial Expression Recognition
- Authors: Yande Li, Mingjie Wang, Minglun Gong, Yonggang Lu, Li Liu
- Abstract summary: A novel multifarious supervision-steering Transformer for Facial Expression Recognition is proposed in this paper.
Our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision.
Experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
- Score: 14.219492977523682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ever-increasing demands for intuitive interactions in Virtual Reality has
triggered a boom in the realm of Facial Expression Recognition (FER). To
address the limitations in existing approaches (e.g., narrow receptive fields
and homogenous supervisory signals) and further cement the capacity of FER
tools, a novel multifarious supervision-steering Transformer for FER in the
wild is proposed in this paper. Referred as FER-former, our approach features
multi-granularity embedding integration, hybrid self-attention scheme, and
heterogeneous domain-steering supervision. In specific, to dig deep into the
merits of the combination of features provided by prevailing CNNs and
Transformers, a hybrid stem is designed to cascade two types of learning
paradigms simultaneously. Wherein, a FER-specific transformer mechanism is
devised to characterize conventional hard one-hot label-focusing and CLIP-based
text-oriented tokens in parallel for final classification. To ease the issue of
annotation ambiguity, a heterogeneous domains-steering supervision module is
proposed to make image features also have text-space semantic correlations by
supervising the similarity between image features and text features. On top of
the collaboration of multifarious token heads, diverse global receptive fields
with multi-modal semantic cues are captured, thereby delivering superb learning
capability. Extensive experiments on popular benchmarks demonstrate the
superiority of the proposed FER-former over the existing state-of-the-arts.
Related papers
- Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain
Generalization [69.33162366130887]
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features.
We introduce a novel method designed to supplement the model with domain-level and task-specific characteristics.
This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization.
arXiv Detail & Related papers (2024-01-18T04:23:21Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Flat Multi-modal Interaction Transformer for Named Entity Recognition [1.7605709999848573]
Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images.
We propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER.
We transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer.
arXiv Detail & Related papers (2022-08-23T15:25:44Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.