MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers
- URL: http://arxiv.org/abs/2308.03741v1
- Date: Tue, 1 Aug 2023 11:00:25 GMT
- Title: MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers
- Authors: Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and
Naveed Akhtar
- Abstract summary: We propose a novel model for the combination of audio-image and video modalities.
This model employs an intuitive approach for the combination of audio-image and video modalities.
Our empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance.
- Score: 18.72489078928417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In line with the human capacity to perceive the world by simultaneously
processing and integrating high-dimensional inputs from multiple modalities
like vision and audio, we propose a novel model, MAiVAR-T (Multimodal
Audio-Image to Video Action Recognition Transformer). This model employs an
intuitive approach for the combination of audio-image and video modalities,
with a primary aim to escalate the effectiveness of multimodal human action
recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling
substantial representations from the audio modality and transmuting these into
the image domain. Subsequently, this audio-image depiction is fused with the
video modality to formulate a unified representation. This concerted approach
strives to exploit the contextual richness inherent in both audio and video
modalities, thereby promoting action recognition. In contrast to existing
state-of-the-art strategies that focus solely on audio or video modalities,
MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations
conducted on a benchmark action recognition dataset corroborate the model's
remarkable performance. This underscores the potential enhancements derived
from integrating audio and video modalities for action recognition purposes.
Related papers
- Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition [8.261744063074612]
We propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for Audio-Visual Emotion Recognition (AVER)
We introduce optical flow information to enrich video representations with texture details that better capture facial state changes.
A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations.
arXiv Detail & Related papers (2024-05-26T21:31:59Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
Masked Autoencoder [3.8735222804007394]
We propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content.
arXiv Detail & Related papers (2023-09-15T19:56:15Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - MAiVAR: Multimodal Audio-Image and Video Action Recognizer [18.72489078928417]
We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task.
We propose a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance.
arXiv Detail & Related papers (2022-09-11T03:52:27Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.