MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers
- URL: http://arxiv.org/abs/2308.03741v1
- Date: Tue, 1 Aug 2023 11:00:25 GMT
- Title: MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers
- Authors: Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and
Naveed Akhtar
- Abstract summary: We propose a novel model for the combination of audio-image and video modalities.
This model employs an intuitive approach for the combination of audio-image and video modalities.
Our empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance.
- Score: 18.72489078928417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In line with the human capacity to perceive the world by simultaneously
processing and integrating high-dimensional inputs from multiple modalities
like vision and audio, we propose a novel model, MAiVAR-T (Multimodal
Audio-Image to Video Action Recognition Transformer). This model employs an
intuitive approach for the combination of audio-image and video modalities,
with a primary aim to escalate the effectiveness of multimodal human action
recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling
substantial representations from the audio modality and transmuting these into
the image domain. Subsequently, this audio-image depiction is fused with the
video modality to formulate a unified representation. This concerted approach
strives to exploit the contextual richness inherent in both audio and video
modalities, thereby promoting action recognition. In contrast to existing
state-of-the-art strategies that focus solely on audio or video modalities,
MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations
conducted on a benchmark action recognition dataset corroborate the model's
remarkable performance. This underscores the potential enhancements derived
from integrating audio and video modalities for action recognition purposes.
Related papers
- VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment [11.063156506583562]
This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions.
Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis.
arXiv Detail & Related papers (2024-09-14T21:58:39Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - MAiVAR: Multimodal Audio-Image and Video Action Recognizer [18.72489078928417]
We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task.
We propose a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance.
arXiv Detail & Related papers (2022-09-11T03:52:27Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.