Related papers: Pretrained Conformers for Audio Fingerprinting and Retrieval

Pretrained Conformers for Audio Fingerprinting and Retrieval

URL: http://arxiv.org/abs/2508.11609v2
Date: Thu, 11 Sep 2025 11:52:50 GMT
Title: Pretrained Conformers for Audio Fingerprinting and Retrieval
Authors: Kemal Altwlkany, Elmedin Selmanovic, Sead Delalic,
Abstract summary: We train conformer-based encoders that are capable of generating unique embeddings for small segments of audio.<n>We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts [59.38012380516272]
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video.<n>To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique.
arXiv Detail & Related papers (2025-09-07T17:55:03Z)
Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z)
Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples. We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition. Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z)
Audiovisual transfer learning for audio tagging and sound event detection [21.574781022415372]
We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection. We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features. We perform experiments with these modified models on an audiovisual multi-label data set.
arXiv Detail & Related papers (2021-06-09T21:55:05Z)
End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer) In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms. We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning [14.60531205031547]
We present a contrastive learning framework that derives from the segment-level search objective. In the segment-level search task, where the conventional audio fingerprinting systems used to fail, our system using 10x smaller storage has shown promising results.
arXiv Detail & Related papers (2020-10-22T17:44:40Z)
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z)
End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner. Our proposed generator is feed-forward and thus efficient for both training and inference. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.