A Study on Robustness to Perturbations for Representations of
Environmental Sound
- URL: http://arxiv.org/abs/2203.10425v2
- Date: Wed, 23 Mar 2022 01:22:01 GMT
- Title: A Study on Robustness to Perturbations for Representations of
Environmental Sound
- Authors: Sangeeta Srivastava, Ho-Hsiang Wu, Joao Rulff, Magdalena Fuentes, Mark
Cartwright, Claudio Silva, Anish Arora, Juan Pablo Bello
- Abstract summary: We evaluate two embeddings -- YAMNet, and OpenL$3$ on monophonic (UrbanSound8K) and polyphonic (SONYC UST) datasets.
We imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new embeddings with three distance measures.
- Score: 16.361059909912758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio applications involving environmental sound analysis increasingly use
general-purpose audio representations, also known as embeddings, for transfer
learning. Recently, Holistic Evaluation of Audio Representations (HEAR)
evaluated twenty-nine embedding models on nineteen diverse tasks. However, the
evaluation's effectiveness depends on the variation already captured within a
given dataset. Therefore, for a given data domain, it is unclear how the
representations would be affected by the variations caused by myriad
microphones' range and acoustic conditions -- commonly known as channel
effects. We aim to extend HEAR to evaluate invariance to channel effects in
this work. To accomplish this, we imitate channel effects by injecting
perturbations to the audio signal and measure the shift in the new (perturbed)
embeddings with three distance measures, making the evaluation domain-dependent
but not task-dependent. Combined with the downstream performance, it helps us
make a more informed prediction of how robust the embeddings are to the channel
effects. We evaluate two embeddings -- YAMNet, and OpenL$^3$ on monophonic
(UrbanSound8K) and polyphonic (SONYC UST) datasets. We show that one distance
measure does not suffice in such task-independent evaluation. Although
Fr\'echet Audio Distance (FAD) correlates with the trend of the performance
drop in the downstream task most accurately, we show that we need to study this
in conjunction with the other distances to get a clear understanding of the
overall effect of the perturbation. In terms of the embedding performance, we
find OpenL$^3$ to be more robust to YAMNet, which aligns with the HEAR
evaluation.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal
Retrieval [7.459223771397159]
Cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared.
To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels.
We propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss.
arXiv Detail & Related papers (2022-11-07T10:37:14Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Blind Room Parameter Estimation Using Multiple-Multichannel Speech
Recordings [37.145413836886455]
Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics.
We study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room.
A novel convolutional neural network architecture leveraging both single- and inter-channel cues is proposed and trained on a large, realistic simulated dataset.
arXiv Detail & Related papers (2021-07-29T08:51:49Z) - Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs.
We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z) - Investigations on Audiovisual Emotion Recognition in Noisy Conditions [43.40644186593322]
We present an investigation on two emotion datasets with superimposed noise at different signal-to-noise ratios.
The results show a significant performance decrease when a model trained on clean audio is applied to noisy data.
arXiv Detail & Related papers (2021-03-02T17:45:16Z) - Exploration of Audio Quality Assessment and Anomaly Localisation Using
Attention Models [37.60722440434528]
In this paper, a novel model for audio quality assessment is proposed by jointly using bidirectional long short-term memory and an attention mechanism.
The former is to mimic a human auditory perception ability to learn information from a recording, and the latter is to further discriminate interferences from desired signals by highlighting target related features.
To evaluate our proposed approach, the TIMIT dataset is used and augmented by mixing with various natural sounds.
arXiv Detail & Related papers (2020-05-16T17:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.