Related papers: Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

URL: http://arxiv.org/abs/2503.02284v1
Date: Tue, 04 Mar 2025 05:13:56 GMT
Title: Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup
Authors: Seokun Kang, Taehwan Kim,
Abstract summary: We propose audio-visual SSL for video action recognition, which uses both visual and audio together.<n>In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
Score: 2.80888070977859
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.

Related papers

Audio-visual training for improved grounding in video-text LLMs [1.9320359360360702]
We propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset.
arXiv Detail & Related papers (2024-07-21T03:59:14Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations. Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks. For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z)
Role of Audio in Audio-Visual Video Summarization [8.785359786012302]
We propose a new audio-visual video summarization framework integrating four ways of audio-visual information fusion with GRU-based and attention-based networks. Experimental evaluations on the TVSum dataset attain F1 score and Kendall-tau score improvements for the audio-visual video summarization.
arXiv Detail & Related papers (2022-12-02T09:11:49Z)
Self-supervised Contrastive Learning for Audio-Visual Action Recognition [7.188231323934023]
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. We propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (A), to learn discriminative audio-visual representations for action recognition.
arXiv Detail & Related papers (2022-04-28T10:01:36Z)
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition. We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels. We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z)
AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information. We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio) Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.