Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching
- URL: http://arxiv.org/abs/2010.05466v1
- Date: Mon, 12 Oct 2020 05:51:55 GMT
- Title: Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching
- Authors: Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding,
Weiyao Lin and Dejing Dou
- Abstract summary: We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
- Score: 87.42246194790467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discriminatively localizing sounding objects in cocktail-party, i.e., mixed
sound scenes, is commonplace for humans, but still challenging for machines. In
this paper, we propose a two-stage learning framework to perform
self-supervised class-aware sounding object localization. First, we propose to
learn robust object representations by aggregating the candidate sound
localization results in the single source scenes. Then, class-aware object
localization maps are generated in the cocktail-party scenarios by referring
the pre-learned object knowledge, and the sounding objects are accordingly
selected by matching audio and visual object category distributions, where the
audiovisual consistency is viewed as the self-supervised signal. Experimental
results in both realistic and synthesized cocktail-party videos demonstrate
that our model is superior in filtering out silent objects and pointing out the
location of sounding objects of different classes. Code is available at
https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.
Related papers
- Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics [26.473529162341837]
We present an audio-visual instance-aware segmentation approach to overcome the dataset bias.
Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio.
Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
arXiv Detail & Related papers (2023-07-31T12:56:30Z) - LISA: Localized Image Stylization with Audio via Implicit Neural
Representation [17.672008998994816]
We present a novel framework, Localized Image Stylization with Audio (LISA)
LISA performs audio-driven localized image stylization.
We show that the proposed framework outperforms the other audio-guided stylization methods.
arXiv Detail & Related papers (2022-11-21T11:51:48Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Contrastive Learning of Global and Local Audio-Visual Representations [25.557229705149577]
We propose a versatile self-supervised approach to learn audio-visual representations that generalizes to tasks that require global semantic information.
We show that our approach learns generalizable video representations on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.
arXiv Detail & Related papers (2021-04-07T07:35:08Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.