The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection
- URL: http://arxiv.org/abs/2409.11262v2
- Date: Fri, 4 Oct 2024 10:35:03 GMT
- Title: The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection
- Authors: Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley,
- Abstract summary: This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults.
The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period.
A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice.
- Score: 15.488319837656702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults. The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic characteristics are documented through detailed floor plans and construction material information to enable replication of the recording environments for AI model deployment. A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice, while preserving segments containing other sound events. The resulting dataset consists of privacy-compliant audio recordings that accurately capture the soundscapes and activities of daily living within residential spaces. The paper details the dataset creation methodology, the speech removal pipeline utilizing cascaded model architectures, and an analysis of the vocal label distribution to validate the speech removal process. This dataset enables the development and benchmarking of sound event detection models tailored specifically for in-home applications.
Related papers
- EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - AGS: An Dataset and Taxonomy for Domestic Scene Sound Event Recognition [1.5106201893222209]
This paper proposes a data set (called as AGS) for the home environment sound.
This data set considers various types of overlapping audio in the scene, background noise.
arXiv Detail & Related papers (2023-08-30T03:03:47Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes
with Spatiotemporal Annotations of Sound Events [30.459545240265246]
Sound events usually derive from visually source objects, e.g., sounds of come from the feet of a walker.
This paper proposes an audio-visual sound event localization and detection (SELD) task.
Audio-visual SELD systems can detect and localize sound events using signals from an array and audio-visual correspondence.
arXiv Detail & Related papers (2023-06-15T13:37:14Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Proposal-based Few-shot Sound Event Detection for Speech and
Environmental Sounds with Perceivers [0.7776497736451751]
We propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture.
Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets.
arXiv Detail & Related papers (2021-07-28T19:46:55Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.