MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
Adversarial Network for Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2306.10567v1
- Date: Sun, 18 Jun 2023 14:02:20 GMT
- Title: MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
Adversarial Network for Audio-Visual Speech Recognition
- Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng
- Abstract summary: We propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
- Score: 23.042478625584653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual speech recognition (AVSR) attracts a surge of research interest
recently by leveraging multimodal signals to understand human speech.
Mainstream approaches addressing this task have developed sophisticated
architectures and techniques for multi-modality fusion and representation
learning. However, the natural heterogeneity of different modalities causes
distribution gap between their representations, making it challenging to fuse
them. In this paper, we aim to learn the shared representations across
modalities to bridge their gap. Different from existing similar methods on
other multimodal tasks like sentiment analysis, we focus on the temporal
contextual dependencies considering the sequence-to-sequence task setting of
AVSR. In particular, we propose an adversarial network to refine frame-level
modality-invariant representations (MIR-GAN), which captures the commonality
across modalities to ease the subsequent multimodal fusion process. Extensive
experiments on public benchmarks LRS3 and LRS2 show that our approach
outperforms the state-of-the-arts.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis [34.28164104577455]
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities.
Past research predominantly focused on improving representation learning techniques and feature fusion strategies.
We introduce a Text-oriented Cross-Attention Network (TCAN) emphasizing the predominant role of the text modality in MSA.
arXiv Detail & Related papers (2024-04-06T07:56:09Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - A cross-modal fusion network based on self-attention and residual
structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition.
To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset.
The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities.
Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z) - MISA: Modality-Invariant and -Specific Representations for Multimodal
Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces.
The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap.
Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.