Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations
- URL: http://arxiv.org/abs/2104.08506v1
- Date: Sat, 17 Apr 2021 10:09:15 GMT
- Title: Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations
- Authors: Lingyu Zhu and Esa Rahtu
- Abstract summary: We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
- Score: 16.447597767676655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this paper is to perform audio-visual sound source
separation, i.e.~to separate component audios from a mixture based on the
videos of sound sources. Moreover, we aim to pinpoint the source location in
the input video sequence. Recent works have shown impressive audio-visual
separation results when using prior knowledge of the source type (e.g. human
playing instrument) and pre-trained motion detectors (e.g. keypoints or optical
flows). However, at the same time, the models are limited to a certain
application domain. In this paper, we address these limitations and make the
following contributions: i) we propose a two-stage architecture, called
Appearance and Motion network (AMnet), where the stages specialise to
appearance and motion cues, respectively. The entire system is trained in a
self-supervised manner; ii) we introduce an Audio-Motion Embedding (AME)
framework to explicitly represent the motions that related to sound; iii) we
propose an audio-motion transformer architecture for audio and motion feature
fusion; iv) we demonstrate state-of-the-art performance on two challenging
datasets (MUSIC-21 and AVE) despite the fact that we do not use any pre-trained
keypoint detectors or optical flow estimators. Project page:
https://ly-zhu.github.io/self-supervised-motion-representations
Related papers
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
Localization [11.059590443280726]
Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research.
In a video, oftentimes, the objects exhibiting movement are the ones generating the sound.
In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source.
arXiv Detail & Related papers (2022-11-06T03:48:45Z) - Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z) - VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.