A Self-Adjusting Fusion Representation Learning Model for Unaligned
Text-Audio Sequences
- URL: http://arxiv.org/abs/2212.11772v1
- Date: Sat, 12 Nov 2022 13:05:28 GMT
- Title: A Self-Adjusting Fusion Representation Learning Model for Unaligned
Text-Audio Sequences
- Authors: Kaicheng Yang, Ruxuan Zhang, Hua Xu, Kai Gao
- Abstract summary: How to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning.
In this paper, a Self-Adjusting Fusion Representation Learning Model is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences.
Experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences.
- Score: 16.38826799727453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inter-modal interaction plays an indispensable role in multimodal sentiment
analysis. Due to different modalities sequences are usually non-alignment, how
to integrate relevant information of each modality to learn fusion
representations has been one of the central challenges in multimodal learning.
In this paper, a Self-Adjusting Fusion Representation Learning Model (SA-FRLM)
is proposed to learn robust crossmodal fusion representations directly from the
unaligned text and audio sequences. Different from previous works, our model
not only makes full use of the interaction between different modalities but
also maximizes the protection of the unimodal characteristics. Specifically, we
first employ a crossmodal alignment module to project different modalities
features to the same dimension. The crossmodal collaboration attention is then
adopted to model the inter-modal interaction between text and audio sequences
and initialize the fusion representations. After that, as the core unit of the
SA-FRLM, the crossmodal adjustment transformer is proposed to protect original
unimodal characteristics. It can dynamically adapt the fusion representations
by using single modal streams. We evaluate our approach on the public
multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment
results show that our model has significantly improved the performance of all
the metrics on the unaligned text-audio sequences.
Related papers
- Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition [16.97833694961584]
Foal-Net is designed to enhance the effectiveness of modality fusion.
It includes two auxiliary tasks: audio-video emotion alignment and cross-modal emotion label matching.
Experiments show that Foal-Net outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-18T11:05:21Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multimodal Action Quality Assessment [40.10252351858076]
Action quality assessment (AQA) is to assess how well an action is performed.
We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy.
We propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information.
arXiv Detail & Related papers (2024-01-31T15:37:12Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.