Class-aware Sounding Objects Localization via Audiovisual Correspondence
- URL: http://arxiv.org/abs/2112.11749v1
- Date: Wed, 22 Dec 2021 09:34:33 GMT
- Title: Class-aware Sounding Objects Localization via Audiovisual Correspondence
- Authors: Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song and Ji-Rong Wen
- Abstract summary: We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
- Score: 51.39872698365446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audiovisual scenes are pervasive in our daily life. It is commonplace for
humans to discriminatively localize different sounding objects but quite
challenging for machines to achieve class-aware sounding objects localization
without category annotations, i.e., localizing the sounding object and
recognizing its category. To address this problem, we propose a two-stage
step-by-step learning framework to localize and recognize sounding objects in
complex audiovisual scenarios using only the correspondence between audio and
vision. First, we propose to determine the sounding area via coarse-grained
audiovisual correspondence in the single source cases. Then visual features in
the sounding area are leveraged as candidate object representations to
establish a category-representation object dictionary for expressive visual
character extraction. We generate class-aware object localization maps in
cocktail-party scenarios and use audiovisual correspondence to suppress silent
areas by referring to this dictionary. Finally, we employ category-level
audiovisual consistency as the supervision to achieve fine-grained audio and
sounding object distribution alignment. Experiments on both realistic and
synthesized videos show that our model is superior in localizing and
recognizing objects as well as filtering out silent ones. We also transfer the
learned audiovisual network into the unsupervised object detection task,
obtaining reasonable performance.
Related papers
- Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics [26.473529162341837]
We present an audio-visual instance-aware segmentation approach to overcome the dataset bias.
Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio.
Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
arXiv Detail & Related papers (2023-07-31T12:56:30Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.