Video-based Cross-modal Auxiliary Network for Multimodal Sentiment
Analysis
- URL: http://arxiv.org/abs/2208.13954v1
- Date: Tue, 30 Aug 2022 02:08:06 GMT
- Title: Video-based Cross-modal Auxiliary Network for Multimodal Sentiment
Analysis
- Authors: Rongfei Chen, Wenju Zhou, Yang Li, Huiyu Zhou
- Abstract summary: Video-based Cross-modal Auxiliary Network (VCAN) is proposed, which is comprised of an audio features map module and a cross-modal selection module.
VCAN is significantly superior to the state-of-the-art methods for improving the classification accuracy of multimodal sentiment analysis.
- Score: 16.930624128228658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal sentiment analysis has a wide range of applications due to its
information complementarity in multimodal interactions. Previous works focus
more on investigating efficient joint representations, but they rarely consider
the insufficient unimodal features extraction and data redundancy of multimodal
fusion. In this paper, a Video-based Cross-modal Auxiliary Network (VCAN) is
proposed, which is comprised of an audio features map module and a cross-modal
selection module. The first module is designed to substantially increase
feature diversity in audio feature extraction, aiming to improve classification
accuracy by providing more comprehensive acoustic representations. To empower
the model to handle redundant visual features, the second module is addressed
to efficiently filter the redundant visual frames during integrating
audiovisual data. Moreover, a classifier group consisting of several image
classification networks is introduced to predict sentiment polarities and
emotion categories. Extensive experimental results on RAVDESS, CMU-MOSI, and
CMU-MOSEI benchmarks indicate that VCAN is significantly superior to the
state-of-the-art methods for improving the classification accuracy of
multimodal sentiment analysis.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Few-Shot Medical Image Segmentation with Large Kernel Attention [5.630842216128902]
We propose a few-shot medical segmentation model that acquire comprehensive feature representation capabilities.
Our model comprises four key modules: a dual-path feature extractor, an attention module, an adaptive prototype prediction module, and a multi-scale prediction fusion module.
The results demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-07-27T02:28:30Z) - Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention [1.1155836879100416]
We propose a Modality-agnostic Domain Generalizable Network (MADGNet) for medical image segmentation.
MFMSA block refines the process of spatial feature extraction, particularly in capturing boundary features.
E-SDM mitigates information loss in multi-task learning with deep supervision.
arXiv Detail & Related papers (2024-05-10T07:34:36Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Accelerated Multi-Modal MR Imaging with Transformers [92.18406564785329]
We propose a multi-modal transformer (MTrans) for accelerated MR imaging.
By restructuring the transformer architecture, our MTrans gains a powerful ability to capture deep multi-modal information.
Our framework provides two appealing benefits: (i) MTrans is the first attempt at using improved transformers for multi-modal MR imaging, affording more global information compared with CNN-based methods.
arXiv Detail & Related papers (2021-06-27T15:01:30Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - A Discriminative Vectorial Framework for Multi-modal Feature
Representation [19.158947368297557]
A discriminative framework is proposed for multimodal feature representation in knowledge discovery.
It employs multi-modal hashing (MH) and discriminative correlation (DCM) analysis.
The framework is superior to state-of-the-art statistical machine learning (S.M.) and deep network neural (DNN) algorithms.
arXiv Detail & Related papers (2021-03-09T18:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.