Related papers: Epic-Sounds: A Large-scale Dataset of Actions That Sound

Epic-Sounds: A Large-scale Dataset of Actions That Sound

URL: http://arxiv.org/abs/2302.00646v1
Date: Wed, 1 Feb 2023 18:19:37 GMT
Title: Epic-Sounds: A Large-scale Dataset of Actions That Sound
Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
Abstract summary: EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels.
Score: 90.1102766891699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.

Related papers

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics [26.399212357764576]
We propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework.<n>To mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal.<n>To reduce the matching difficulty, we introduce a discriminative feature learning module.
arXiv Detail & Related papers (2025-03-17T05:38:05Z)
Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation [36.50512269898893]
To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required. We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information. Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
arXiv Detail & Related papers (2023-09-18T05:58:06Z)
AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework. It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z)
BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data. In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z)
Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics [26.473529162341837]
We present an audio-visual instance-aware segmentation approach to overcome the dataset bias. Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
arXiv Detail & Related papers (2023-07-31T12:56:30Z)
Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z)
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events [30.459545240265246]
Sound events usually derive from visually source objects, e.g., sounds of come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task. Audio-visual SELD systems can detect and localize sound events using signals from an array and audio-visual correspondence.
arXiv Detail & Related papers (2023-06-15T13:37:14Z)
A dataset for Audio-Visual Sound Event Detection in Movies [33.59510253345295]
We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S) We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies. We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
arXiv Detail & Related papers (2023-02-14T19:55:39Z)
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning. We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio) Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.