MAiVAR: Multimodal Audio-Image and Video Action Recognizer
- URL: http://arxiv.org/abs/2209.04780v1
- Date: Sun, 11 Sep 2022 03:52:27 GMT
- Title: MAiVAR: Multimodal Audio-Image and Video Action Recognizer
- Authors: Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and
Naveed Akhtar
- Abstract summary: We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task.
We propose a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance.
- Score: 18.72489078928417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, action recognition is predominately performed on video data as
processed by CNNs. We investigate if the representation process of CNNs can
also be leveraged for multimodal action recognition by incorporating
image-based audio representations of actions in a task. To this end, we propose
Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based
audio-image to video fusion model that accounts for video and audio modalities
to achieve superior action recognition performance. MAiVAR extracts meaningful
image representations of audio and fuses it with video representation to
achieve better performance as compared to both modalities individually on a
large-scale action recognition dataset.
Related papers
- Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
Masked Autoencoder [3.8735222804007394]
We propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content.
arXiv Detail & Related papers (2023-09-15T19:56:15Z) - MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers [18.72489078928417]
We propose a novel model for the combination of audio-image and video modalities.
This model employs an intuitive approach for the combination of audio-image and video modalities.
Our empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance.
arXiv Detail & Related papers (2023-08-01T11:00:25Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.