Localizing Visual Sounds the Hard Way
- URL: http://arxiv.org/abs/2104.02691v1
- Date: Tue, 6 Apr 2021 17:38:18 GMT
- Title: Localizing Visual Sounds the Hard Way
- Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea
Vedaldi, Andrew Zisserman
- Abstract summary: We train the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound.
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
We introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset.
- Score: 149.84890978170174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of this work is to localize sound sources that are visible in a
video without using manual annotations. Our key technical contribution is to
show that, by training the network to explicitly discriminate challenging image
fragments, even for images that do contain the object emitting the sound, we
can significantly boost the localization performance. We do so elegantly by
introducing a mechanism to mine hard samples and add them to a contrastive
learning formulation automatically. We show that our algorithm achieves
state-of-the-art performance on the popular Flickr SoundNet dataset.
Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of
annotations for the recently-introduced VGG-Sound dataset, where the sound
sources visible in each video clip are explicitly marked with bounding box
annotations. This dataset is 20 times larger than analogous existing ones,
contains 5K videos spanning over 200 categories, and, differently from Flickr
SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves
state-of-the-art performance against several baselines.
Related papers
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Audio-Visual Glance Network for Efficient Video Recognition [17.95844876568496]
We propose Audio-Visual Network (AVGN) to efficiently process the-temporally important parts of a video.
We use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame.
We incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN.
arXiv Detail & Related papers (2023-08-18T05:46:20Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Sound-Guided Semantic Video Generation [15.225598817462478]
We propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space.
As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound.
arXiv Detail & Related papers (2022-04-20T07:33:10Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.