Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol
- URL: http://arxiv.org/abs/2507.06070v1
- Date: Tue, 08 Jul 2025 15:13:26 GMT
- Title: Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol
- Authors: Christos Nikou, Theodoros Giannakopoulos,
- Abstract summary: Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms.<n>While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments.<n>We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device's microphone.<n>Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks.
- Score: 1.8842532732272859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms. While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments. In this paper, we introduce a novel evaluation protocol designed to better reflect such real-world conditions. We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device's microphone. Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks. Additionally, we highlight the critical role of the augmentation pipeline during training with contrastive loss. By introduction low pass and high pass filters in the augmentation pipeline we significantly increase the performance of both systems in our proposed evaluation. Furthermore, we develop a transformer-based model with a tailored projection module and demonstrate that transferring knowledge from a semantically relevant domain yields a more robust solution. The transformer architecture outperforms CNN-based models across all noise levels, and query durations. In low noise conditions it achieves 47.99% for 1-sec queries, and 97% for 10-sec queries in finding the correct song, surpassing by 14%, and by 18.5% the second-best performing model, respectively, Under heavy noise levels, we achieve a detection rate 56.5% for 15-second query duration. All experiments are conducted on public large-scale dataset of over 100K songs, with queries matched against a database of 56 million vectors.
Related papers
- End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation [8.11594945165255]
We propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms.<n>Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing.
arXiv Detail & Related papers (2025-04-29T16:38:23Z) - Measuring the Robustness of Audio Deepfake Detectors [59.09338266364506]
This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions.<n>Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations.
arXiv Detail & Related papers (2025-03-21T23:21:17Z) - Effects of Dataset Sampling Rate for Noise Cancellation through Deep Learning [1.024113475677323]
This research explores the use of deep neural networks (DNNs) as a superior alternative to traditional noise cancellation techniques.
The ConvTasNET network was trained on datasets such as WHAM!, LibriMix, and the MS-2023 DNS Challenge.
Models trained at higher sampling rates (48kHz) provided much better evaluation metrics against Total Harmonic Distortion (THD) and Quality Prediction For Generative Neural Speech Codecs (WARP-Q) values.
arXiv Detail & Related papers (2024-05-30T16:20:44Z) - Collaborative Learning with a Drone Orchestrator [79.75113006257872]
A swarm of intelligent wireless devices train a shared neural network model with the help of a drone.
The proposed framework achieves a significant speedup in training, leading to an average 24% and 87% saving in the drone hovering time.
arXiv Detail & Related papers (2023-03-03T23:46:25Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Deep Learning-Based Acoustic Mosquito Detection in Noisy Conditions
Using Trainable Kernels and Augmentations [17.77602155559703]
We demonstrate a unique recipe to enhance the effectiveness of audio machine learning approaches by fusing pre-processing techniques into a deep learning model.
Our solution accelerates training and inference performance by optimizing hyper- parameters through training instead of costly random searches to build a reliable mosquito detector from audio signals.
arXiv Detail & Related papers (2022-07-28T01:05:40Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Practical Deep Raw Image Denoising on Mobile Devices [30.3578422624862]
We propose a light-weight, efficient neural network-based raw image denoiser that runs smoothly on mainstream mobile devices.
Our proposed mobile-friendly denoising model runs at 70 milliseconds per megapixel on Qualcomm Snapdragon 855 chipset.
arXiv Detail & Related papers (2020-10-14T10:30:32Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.