STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes
with Spatiotemporal Annotations of Sound Events
- URL: http://arxiv.org/abs/2306.09126v2
- Date: Tue, 14 Nov 2023 08:29:23 GMT
- Title: STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes
with Spatiotemporal Annotations of Sound Events
- Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel
Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya
Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji
- Abstract summary: Sound events usually derive from visually source objects, e.g., sounds of come from the feet of a walker.
This paper proposes an audio-visual sound event localization and detection (SELD) task.
Audio-visual SELD systems can detect and localize sound events using signals from an array and audio-visual correspondence.
- Score: 30.459545240265246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While direction of arrival (DOA) of sound events is generally estimated from
multichannel audio data recorded in a microphone array, sound events usually
derive from visually perceptible source objects, e.g., sounds of footsteps come
from the feet of a walker. This paper proposes an audio-visual sound event
localization and detection (SELD) task, which uses multichannel audio and video
information to estimate the temporal activation and DOA of target sound events.
Audio-visual SELD systems can detect and localize sound events using signals
from a microphone array and audio-visual correspondence. We also introduce an
audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23),
which consists of multichannel audio data recorded with a microphone array,
video data, and spatiotemporal annotation of sound events. Sound scenes in
STARSS23 are recorded with instructions, which guide recording participants to
ensure adequate activity and occurrences of sound events. STARSS23 also serves
human-annotated temporal activation labels and human-confirmed DOA labels,
which are based on tracking results of a motion capture system. Our benchmark
results demonstrate the benefits of using visual object positions in
audio-visual SELD tasks. The data is available at
https://zenodo.org/record/7880637.
Related papers
- DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection [16.92604848450722]
This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones.
We propose a novel method of pretraining the feature extraction part of the deep neural network (DNN) in a self-supervised manner.
arXiv Detail & Related papers (2024-10-30T08:31:58Z) - Enhanced Sound Event Localization and Detection in Real 360-degree
audio-visual soundscapes [0.0]
We build on the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information.
We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation.
arXiv Detail & Related papers (2024-01-29T06:05:23Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - A dataset for Audio-Visual Sound Event Detection in Movies [33.59510253345295]
We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S)
We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies.
We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
arXiv Detail & Related papers (2023-02-14T19:55:39Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese
Structure with a Pairwise Presence Matrix [11.54047475139282]
State of the art sound event retrieval models have focused on single-label audio recordings.
We propose different Deep Learning architectures with a Siamese-structure and a Pairwise Presence Matrix.
The networks are trained and evaluated using the SONYC-UST dataset containing both single- and multi-label soundscape recordings.
arXiv Detail & Related papers (2020-02-20T21:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.