Related papers: MAiVAR: Multimodal Audio-Image and Video Action Recognizer

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

URL: http://arxiv.org/abs/2209.04780v1
Date: Sun, 11 Sep 2022 03:52:27 GMT
Title: MAiVAR: Multimodal Audio-Image and Video Action Recognizer
Authors: Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and Naveed Akhtar
Abstract summary: We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. We propose a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance.
Score: 18.72489078928417
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.

Related papers

Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup [2.80888070977859]
We propose audio-visual SSL for video action recognition, which uses both visual and audio together. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
arXiv Detail & Related papers (2025-03-04T05:13:56Z)
VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder [3.8735222804007394]
We propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content.
arXiv Detail & Related papers (2023-09-15T19:56:15Z)
MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers [18.72489078928417]
We propose a novel model for the combination of audio-image and video modalities. This model employs an intuitive approach for the combination of audio-image and video modalities. Our empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance.
arXiv Detail & Related papers (2023-08-01T11:00:25Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition. We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels. We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z)
Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time. For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information. Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.