A dataset for Audio-Visual Sound Event Detection in Movies
- URL: http://arxiv.org/abs/2302.07315v1
- Date: Tue, 14 Feb 2023 19:55:39 GMT
- Title: A dataset for Audio-Visual Sound Event Detection in Movies
- Authors: Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai,
Shrikanth Narayanan
- Abstract summary: We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S)
We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies.
We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
- Score: 33.59510253345295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio event detection is a widely studied audio processing task, with
applications ranging from self-driving cars to healthcare. In-the-wild datasets
such as Audioset have propelled research in this field. However, many efforts
typically involve manual annotation and verification, which is expensive to
perform at scale. Movies depict various real-life and fictional scenarios which
makes them a rich resource for mining a wide-range of audio events. In this
work, we present a dataset of audio events called Subtitle-Aligned Movie Sounds
(SAM-S). We use publicly-available closed-caption transcripts to automatically
mine over 110K audio events from 430 movies. We identify three dimensions to
categorize audio events: sound, source, quality, and present the steps involved
to produce a final taxonomy of 245 sounds. We discuss the choices involved in
generating the taxonomy, and also highlight the human-centered nature of sounds
in our dataset. We establish a baseline performance for audio-only sound
classification of 34.76% mean average precision and show that incorporating
visual information can further improve the performance by about 5%. Data and
code are made available for research at
https://github.com/usc-sail/mica-subtitle-aligned-movie-sounds
Related papers
- Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes
with Spatiotemporal Annotations of Sound Events [30.459545240265246]
Sound events usually derive from visually source objects, e.g., sounds of come from the feet of a walker.
This paper proposes an audio-visual sound event localization and detection (SELD) task.
Audio-visual SELD systems can detect and localize sound events using signals from an array and audio-visual correspondence.
arXiv Detail & Related papers (2023-06-15T13:37:14Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - FSD50K: An Open Dataset of Human-Labeled Sound Events [30.42735806815691]
We introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology.
The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms)
arXiv Detail & Related papers (2020-10-01T15:07:25Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.