The Sound of Bounding-Boxes
- URL: http://arxiv.org/abs/2203.15991v1
- Date: Wed, 30 Mar 2022 01:58:52 GMT
- Title: The Sound of Bounding-Boxes
- Authors: Takashi Oya, Shohei Iwase, Shigeo Morishima
- Abstract summary: We propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously.
Although being fully unsupervised, our method performs comparably in separation accuracy.
- Score: 12.019518891110007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the task of audio-visual sound source separation, which leverages visual
information for sound source separation, identifying objects in an image is a
crucial step prior to separating the sound source. However, existing methods
that assign sound on detected bounding boxes suffer from a problem that their
approach heavily relies on pre-trained object detectors. Specifically, when
using these existing methods, it is required to predetermine all the possible
categories of objects that can produce sound and use an object detector
applicable to all such categories. To tackle this problem, we propose a fully
unsupervised method that learns to detect objects in an image and separate
sound source simultaneously. As our method does not rely on any pre-trained
detector, our method is applicable to arbitrary categories without any
additional annotation. Furthermore, although being fully unsupervised, we found
that our method performs comparably in separation accuracy.
Related papers
- Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge [14.801564966406486]
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually.
We present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources.
arXiv Detail & Related papers (2024-03-26T06:27:50Z) - Universal Noise Annotation: Unveiling the Impact of Noisy annotation on
Object Detection [36.318411642128446]
We propose Universal-Noise.
(UNA), a more practical setting that encompasses all types of noise that can occur in object detection.
We analyzed the development direction of previous works of detection algorithms and examined the factors that impact the robustness of detection model learning method.
We open-source the code for injecting UNA into the dataset and all the training log and weight are also shared.
arXiv Detail & Related papers (2023-12-21T13:12:37Z) - Integrating Audio-Visual Features for Multimodal Deepfake Detection [33.51027054306748]
Deepfakes are AI-generated media in which an image or video has been digitally modified.
This paper proposes an audio-visual-based method for deepfake detection, which integrates fine-grained deepfake identification with binary classification.
arXiv Detail & Related papers (2023-10-05T18:19:56Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Self-supervised object detection from audio-visual correspondence [101.46794879729453]
We tackle the problem of learning object detectors without supervision.
We do not assume image-level class labels, instead we extract a supervisory signal from audio-visual data.
We show that our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
arXiv Detail & Related papers (2021-04-13T17:59:03Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z) - Towards Noise-resistant Object Detection with Noisy Annotations [119.63458519946691]
Training deep object detectors requires significant amount of human-annotated images with accurate object labels and bounding box coordinates.
Noisy annotations are much more easily accessible, but they could be detrimental for learning.
We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise.
arXiv Detail & Related papers (2020-03-03T01:32:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.