Audio Contrastive based Fine-tuning
- URL: http://arxiv.org/abs/2309.11895v3
- Date: Thu, 19 Oct 2023 21:32:01 GMT
- Title: Audio Contrastive based Fine-tuning
- Authors: Yang Wang, Qibin Liang, Chenghao Xiao, Yizhi Li, Noura Al Moubayed,
Chenghua Lin
- Abstract summary: We introduce Audio Contrastive-based Fine-tuning (AudioConFit) as an efficient approach characterised by robust generalisability.
Empirical experiments on a variety of audio classification tasks demonstrate the effectiveness and robustness of our approach.
- Score: 21.145936249583446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio classification plays a crucial role in speech and sound processing
tasks with a wide range of applications. There still remains a challenge of
striking the right balance between fitting the model to the training data
(avoiding overfitting) and enabling it to generalise well to a new domain.
Leveraging the transferability of contrastive learning, we introduce Audio
Contrastive-based Fine-tuning (AudioConFit), an efficient approach
characterised by robust generalisability. Empirical experiments on a variety of
audio classification tasks demonstrate the effectiveness and robustness of our
approach, which achieves state-of-the-art results in various settings.
Related papers
- Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Multi-source Domain Adaptation for Text-independent Forensic Speaker
Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model.
Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains.
Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Noise-Tolerant Learning for Audio-Visual Action Recognition [31.641972732424463]
Video datasets are usually coarse-annotated or collected from the Internet.
We propose a noise-tolerant learning framework to find anti-interference model parameters against both noisy labels and noisy correspondence.
Our method significantly improves the robustness of the action recognition model and surpasses the baselines by a clear margin.
arXiv Detail & Related papers (2022-05-16T12:14:03Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z) - Spectrum-Guided Adversarial Disparity Learning [52.293230153385124]
We propose a novel end-to-end knowledge directed adversarial learning framework.
It portrays the class-conditioned intraclass disparity using two competitive encoding distributions and learns the purified latent codes by denoising learned disparity.
The experiments on four HAR benchmark datasets demonstrate the robustness and generalization of our proposed methods over a set of state-of-the-art.
arXiv Detail & Related papers (2020-07-14T05:46:27Z) - A Sequential Self Teaching Approach for Improving Generalization in
Sound Event Recognition [11.559570255513217]
We propose a sequential self-teaching approach to learning sounds.
It is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data.
Our proposal is a sequential stage-wise learning process that improves generalization capabilities of a given modeling system.
arXiv Detail & Related papers (2020-06-30T22:53:43Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.