Epic-Sounds: A Large-scale Dataset of Actions That Sound
- URL: http://arxiv.org/abs/2302.00646v1
- Date: Wed, 1 Feb 2023 18:19:37 GMT
- Title: Epic-Sounds: A Large-scale Dataset of Actions That Sound
- Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew
Zisserman
- Abstract summary: EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels.
- Score: 90.1102766891699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations
capturing temporal extents and class labels within the audio stream of the
egocentric videos. We propose an annotation pipeline where annotators
temporally label distinguishable audio segments and describe the action that
could have caused this sound. We identify actions that can be discriminated
purely from audio, through grouping these free-form descriptions of audio into
classes. For actions that involve objects colliding, we collect human
annotations of the materials of these objects (e.g. a glass object being placed
on a wooden surface), which we verify from visual labels, discarding
ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of
audible events and actions, distributed across 44 classes as well as 39.2k
non-categorised segments. We train and evaluate two state-of-the-art audio
recognition models on our dataset, highlighting the importance of audio-only
labels and the limitations of current models to recognise actions that sound.
Related papers
- BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework.
In the first stage, we employ a segmentation model to localize potential sounding objects from visual data.
In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z) - Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics [26.473529162341837]
We present an audio-visual instance-aware segmentation approach to overcome the dataset bias.
Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio.
Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
arXiv Detail & Related papers (2023-07-31T12:56:30Z) - STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes
with Spatiotemporal Annotations of Sound Events [30.459545240265246]
Sound events usually derive from visually source objects, e.g., sounds of come from the feet of a walker.
This paper proposes an audio-visual sound event localization and detection (SELD) task.
Audio-visual SELD systems can detect and localize sound events using signals from an array and audio-visual correspondence.
arXiv Detail & Related papers (2023-06-15T13:37:14Z) - A dataset for Audio-Visual Sound Event Detection in Movies [33.59510253345295]
We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S)
We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies.
We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
arXiv Detail & Related papers (2023-02-14T19:55:39Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - ARCA23K: An audio dataset for investigating open-set label noise [48.683197172795865]
This paper introduces ARCA23K, an automatically Retrieved and curated audio dataset comprised of over 23000 labelled Freesound clips.
We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise.
arXiv Detail & Related papers (2021-09-19T21:10:25Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.