VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
- URL: http://arxiv.org/abs/2512.10120v1
- Date: Wed, 10 Dec 2025 22:13:12 GMT
- Title: VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
- Authors: Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser,
- Abstract summary: VocSim is a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings.<n>VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds.
- Score: 1.0791267046450075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time-frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.
Related papers
- Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion [0.0]
In bioacoustic classification, species identity may be inferred both from the acoustic signal and from context as location and season.<n>We introduce FINCH, an adaptive log-linear evidence fusion framework that integrates a pre-trainedtext audio classifier with a structuredtemporal predictor.<n>FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs.
arXiv Detail & Related papers (2026-02-03T18:21:13Z) - MARS-Sep: Multimodal-Aligned Reinforced Sound Separation [72.85468563236005]
MARS-Sep is a reinforcement learning framework for sound separation.<n>It learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate.<n>Experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation.
arXiv Detail & Related papers (2025-10-12T09:05:28Z) - Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification [8.07177858013243]
Self-supervised learning in audio defaults to fine-tuning.<n>We introduce binarized probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation.<n>Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.
arXiv Detail & Related papers (2025-09-29T15:11:18Z) - PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning [17.302186298424836]
Cross-modal retrieval aims to align different modalities via semantic similarity.<n>Existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data.
arXiv Detail & Related papers (2025-09-19T05:41:17Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Robust Online Classification: From Estimation to Denoising [14.535583931446807]
We study online classification of features into labels with general hypothesis classes.
Predictions are made using observed noisy labels and noiseless features.
The performance is measured via minimax risk when comparing against true labels.
arXiv Detail & Related papers (2023-09-04T16:17:39Z) - Class Prototype-based Cleaner for Label Noise Learning [73.007001454085]
Semi-supervised learning methods are current SOTA solutions to the noisy-label learning problem.
We propose a simple yet effective solution, named textbfClass textbfPrototype-based label noise textbfCleaner.
arXiv Detail & Related papers (2022-12-21T04:56:41Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.