Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition
- URL: http://arxiv.org/abs/2401.17604v2
- Date: Thu, 8 Feb 2024 11:24:54 GMT
- Title: Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition
- Authors: Lei Liu and Li Liu and Haizhou Li
- Abstract summary: Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
- Score: 48.84506301960988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cued Speech (CS) is a pure visual coding method used by hearing-impaired
people that combines lip reading with several specific hand shapes to make the
spoken language visible. Automatic CS recognition (ACSR) seeks to transcribe
visual cues of speech into text, which can help hearing-impaired people to
communicate effectively. The visual information of CS contains lip reading and
hand cueing, thus the fusion of them plays an important role in ACSR. However,
most previous fusion methods struggle to capture the global dependency present
in long sequence inputs of multi-modal CS data. As a result, these methods
generally fail to learn the effective cross-modal relationships that contribute
to the fusion. Recently, attention-based transformers have been a prevalent
idea for capturing the global dependency over the long sequence in multi-modal
fusion, but existing multi-modal fusion transformers suffer from both poor
recognition accuracy and inefficient computation for the ACSR task. To address
these problems, we develop a novel computation and parameter efficient
multi-modal fusion transformer by proposing a novel Token-Importance-Aware
Attention mechanism (TIAA), where a token utilization rate (TUR) is formulated
to select the important tokens from the multi-modal streams. More precisely,
TIAA firstly models the modality-specific fine-grained temporal dependencies
over all tokens of each modality, and then learns the efficient cross-modal
interaction for the modality-shared coarse-grained temporal dependencies over
the important tokens of different modalities. Besides, a light-weight gated
hidden projection is designed to control the feature flows of TIAA. The
resulting model, named Economical Cued Speech Fusion Transformer (EcoCued),
achieves state-of-the-art performance on all existing CS datasets, compared
with existing transformer-based fusion methods and ACSR fusion methods.
Related papers
- CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation [8.874033487493913]
Multimodal emotion recognition in conversation aims to accurately identify emotions in conversational utterances.
We propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components.
Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-15T09:23:02Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues, even those from semantically unrelated objects or regions.
We propose boosting Image Restoration's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - A Self-Adjusting Fusion Representation Learning Model for Unaligned
Text-Audio Sequences [16.38826799727453]
How to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning.
In this paper, a Self-Adjusting Fusion Representation Learning Model is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences.
Experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences.
arXiv Detail & Related papers (2022-11-12T13:05:28Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.