Benchmarking Representations for Speech, Music, and Acoustic Events
- URL: http://arxiv.org/abs/2405.00934v1
- Date: Thu, 2 May 2024 01:24:53 GMT
- Title: Benchmarking Representations for Speech, Music, and Acoustic Events
- Authors: Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, Sabato Marco Siniscalchi,
- Abstract summary: ARCH is a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains.
ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes.
To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets.
- Score: 24.92641211471113
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - A Suite for Acoustic Language Model Evaluation [20.802090523583196]
We introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response.
We evaluate several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method.
arXiv Detail & Related papers (2024-09-11T17:34:52Z) - Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.
The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.
We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z) - Houston we have a Divergence: A Subgroup Performance Analysis of ASR Models [2.5168553347063862]
The Fearless Steps APOLLO Community Resource provides unparalleled opportunities to explore the potential of multi-speaker team communications.
This study focuses on discovering the characteristics that make Apollo recordings more or less intelligible to Automatic Speech Recognition (ASR) methods.
arXiv Detail & Related papers (2024-03-31T10:06:19Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - MetaAudio: A Few-Shot Audio Classification Benchmark [2.294014185517203]
This work aims to alleviate this reliance on image-based benchmarks by offering the first comprehensive, public and fully reproducible audio based alternative.
We compare the few-shot classification performance of a variety of techniques on seven audio datasets.
Our experimentation shows gradient-based meta-learning methods such as MAML and Meta-Curvature consistently outperform both metric and baseline methods.
arXiv Detail & Related papers (2022-04-05T11:33:44Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.