Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2208.07589v2
- Date: Mon, 22 May 2023 02:27:07 GMT
- Title: Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis
- Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
- Abstract summary: Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.
Despite significant progress, there are still two major challenges on the way towards robust MSA.
We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
- Score: 47.29528724322795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the proliferation of user-generated online videos, Multimodal Sentiment
Analysis (MSA) has attracted increasing attention recently. Despite significant
progress, there are still two major challenges on the way towards robust MSA:
1) inefficiency when modeling cross-modal interactions in unaligned multimodal
data; and 2) vulnerability to random modality feature missing which typically
occurs in realistic settings. In this paper, we propose a generic and unified
framework to address them, named Efficient Multimodal Transformer with
Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs
utterance-level representations from each modality as the global multimodal
context to interact with local unimodal features and mutually promote each
other. It not only avoids the quadratic scaling cost of previous local-local
cross-modal interaction methods but also leads to better performance. To
improve model robustness in the incomplete modality setting, on the one hand,
DLFR performs low-level feature reconstruction to implicitly encourage the
model to learn semantic information from incomplete data. On the other hand, it
innovatively regards complete and incomplete data as two different views of one
sample and utilizes siamese representation learning to explicitly attract their
high-level representations. Comprehensive experiments on three popular datasets
demonstrate that our method achieves superior performance in both complete and
incomplete modality settings.
Related papers
- MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Cross-Modal Prototype based Multimodal Federated Learning under Severely
Missing Modality [31.727012729846333]
Multimodal Federated Cross Prototype Learning (MFCPL) is a novel approach for MFL under severely missing modalities.
MFCPL provides diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism.
Our approach introduces the cross-modal alignment to provide regularization for modality-specific features, thereby enhancing overall performance.
arXiv Detail & Related papers (2024-01-25T02:25:23Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Visual Prompt Flexible-Modal Face Anti-Spoofing [23.58674017653937]
multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors.
We propose flexible-modal FAS, which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task.
experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework.
arXiv Detail & Related papers (2023-07-26T05:06:41Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - Multimodal Federated Learning via Contrastive Representation Ensemble [17.08211358391482]
Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning.
Existing FL methods all rely on model aggregation on single modality level.
We propose Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL)
arXiv Detail & Related papers (2023-02-17T14:17:44Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.