Robust Feature Learning on Long-Duration Sounds for Acoustic Scene
Classification
- URL: http://arxiv.org/abs/2108.05008v1
- Date: Wed, 11 Aug 2021 03:33:05 GMT
- Title: Robust Feature Learning on Long-Duration Sounds for Acoustic Scene
Classification
- Authors: Yuzhong Wu, Tan Lee
- Abstract summary: Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded.
We propose a robust feature learning (RFL) framework to train the CNN.
- Score: 54.57150493905063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acoustic scene classification (ASC) aims to identify the type of scene
(environment) in which a given audio signal is recorded. The log-mel feature
and convolutional neural network (CNN) have recently become the most popular
time-frequency (TF) feature representation and classifier in ASC. An audio
signal recorded in a scene may include various sounds overlapping in time and
frequency. The previous study suggests that separately considering the
long-duration sounds and short-duration sounds in CNN may improve ASC accuracy.
This study addresses the problem of the generalization ability of acoustic
scene classifiers. In practice, acoustic scene signals' characteristics may be
affected by various factors, such as the choice of recording devices and the
change of recording locations. When an established ASC system predicts scene
classes on audios recorded in unseen scenarios, its accuracy may drop
significantly. The long-duration sounds not only contain domain-independent
acoustic scene information, but also contain channel information determined by
the recording conditions, which is prone to over-fitting. For a more robust ASC
system, We propose a robust feature learning (RFL) framework to train the CNN.
The RFL framework down-weights CNN learning specifically on long-duration
sounds. The proposed method is to train an auxiliary classifier with only
long-duration sound information as input. The auxiliary classifier is trained
with an auxiliary loss function that assigns less learning weight to poorly
classified examples than the standard cross-entropy loss. The experimental
results show that the proposed RFL framework can obtain a more robust acoustic
scene classifier towards unseen devices and cities.
Related papers
- Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - A Comparative Study on Approaches to Acoustic Scene Classification using
CNNs [0.0]
Different kinds of representations have dramatic effects on the accuracy of the classification.
We investigated the spectrograms, MFCCs, and embeddings representations using different CNN networks and autoencoders.
We found that the spectrogram representation has the highest classification accuracy while MFCC has the lowest classification accuracy.
arXiv Detail & Related papers (2022-04-26T09:23:29Z) - Deep Convolutional Neural Network for Roadway Incident Surveillance
Using Audio Data [0.0]
Crash events identification and prediction plays a vital role in understanding safety conditions for transportation systems.
We propose the use of a novel sensory unit that can also accurately identify crash events: microphone.
Four events such as crash, tire skid, horn and siren sounds can be accurately identified giving indication of a road hazard.
arXiv Detail & Related papers (2022-03-09T13:42:56Z) - Audiovisual transfer learning for audio tagging and sound event
detection [21.574781022415372]
We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection.
We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features.
We perform experiments with these modified models on an audiovisual multi-label data set.
arXiv Detail & Related papers (2021-06-09T21:55:05Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - Acoustic Scene Classification with Squeeze-Excitation Residual Networks [4.591851728010269]
We propose two novel squeeze-excitation blocks to improve the accuracy of a CNN-based ASC framework based on residual learning.
The behavior of the block that implements such operators and, therefore, the entire neural network, can be modified depending on the input to the block.
arXiv Detail & Related papers (2020-03-20T14:07:11Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.