MAiVAR: Multimodal Audio-Image and Video Action Recognizer
- URL: http://arxiv.org/abs/2209.04780v1
- Date: Sun, 11 Sep 2022 03:52:27 GMT
- Title: MAiVAR: Multimodal Audio-Image and Video Action Recognizer
- Authors: Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and
Naveed Akhtar
- Abstract summary: We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task.
We propose a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance.
- Score: 18.72489078928417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, action recognition is predominately performed on video data as
processed by CNNs. We investigate if the representation process of CNNs can
also be leveraged for multimodal action recognition by incorporating
image-based audio representations of actions in a task. To this end, we propose
Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based
audio-image to video fusion model that accounts for video and audio modalities
to achieve superior action recognition performance. MAiVAR extracts meaningful
image representations of audio and fuses it with video representation to
achieve better performance as compared to both modalities individually on a
large-scale action recognition dataset.
Related papers
- VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
Masked Autoencoder [3.8735222804007394]
We propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content.
arXiv Detail & Related papers (2023-09-15T19:56:15Z) - MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using
Transformers [18.72489078928417]
We propose a novel model for the combination of audio-image and video modalities.
This model employs an intuitive approach for the combination of audio-image and video modalities.
Our empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance.
arXiv Detail & Related papers (2023-08-01T11:00:25Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.