AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
Masked Autoencoder
- URL: http://arxiv.org/abs/2309.08738v2
- Date: Wed, 20 Dec 2023 22:20:46 GMT
- Title: AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
Masked Autoencoder
- Authors: Xingjian Diao, Ming Cheng, and Shitong Cheng
- Abstract summary: We propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content.
- Score: 3.8735222804007394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning high-quality video representation has shown significant applications
in computer vision and remains challenging. Previous work based on mask
autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of
learning representations in images and videos through reconstruction strategy
in the visual modality. However, these models exhibit inherent limitations,
particularly in scenarios where extracting features solely from the visual
modality proves challenging, such as when dealing with low-resolution and
blurry original videos. Based on this, we propose AV-MaskEnhancer for learning
high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature
of audio and video features in cross-modality content. Moreover, our result of
the video classification task on the UCF101 dataset outperforms the existing
work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a
top-5 accuracy of 99.9%.
Related papers
- Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings
for Video Action Recognition [4.36572039512405]
We present the first pose augmented Vision-language model (VLM) for Video Action Recognition.
Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets.
arXiv Detail & Related papers (2023-08-07T20:50:54Z) - ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders [11.727612242016871]
ViC-MAE is a model that combines Masked AutoEncoders (MAE) and contrastive learning.
We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks.
arXiv Detail & Related papers (2023-03-21T16:33:40Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - MAiVAR: Multimodal Audio-Image and Video Action Recognizer [18.72489078928417]
We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task.
We propose a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance.
arXiv Detail & Related papers (2022-09-11T03:52:27Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.