Neural Audio Fingerprint for High-specific Audio Retrieval based on
Contrastive Learning
- URL: http://arxiv.org/abs/2010.11910v4
- Date: Wed, 10 Feb 2021 08:23:59 GMT
- Title: Neural Audio Fingerprint for High-specific Audio Retrieval based on
Contrastive Learning
- Authors: Sungkyun Chang, Donmoon Lee, Jeongsoo Park, Hyungui Lim, Kyogu Lee,
Karam Ko, Yoonchang Han
- Abstract summary: We present a contrastive learning framework that derives from the segment-level search objective.
In the segment-level search task, where the conventional audio fingerprinting systems used to fail, our system using 10x smaller storage has shown promising results.
- Score: 14.60531205031547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of existing audio fingerprinting systems have limitations to be used for
high-specific audio retrieval at scale. In this work, we generate a
low-dimensional representation from a short unit segment of audio, and couple
this fingerprint with a fast maximum inner-product search. To this end, we
present a contrastive learning framework that derives from the segment-level
search objective. Each update in training uses a batch consisting of a set of
pseudo labels, randomly selected original samples, and their augmented
replicas. These replicas can simulate the degrading effects on original audio
signals by applying small time offsets and various types of distortions, such
as background noise and room/microphone impulse responses. In the segment-level
search task, where the conventional audio fingerprinting systems used to fail,
our system using 10x smaller storage has shown promising results. Our code and
dataset are available at \url{https://mimbres.github.io/neural-audio-fp/}.
Related papers
- Language-based Audio Moment Retrieval [14.227865973426843]
We propose and design a new task called audio moment retrieval (AMR)
Unlike conventional language-based audio retrieval tasks, AMR aims to predict relevant moments in untrimmed long audio based on a text query.
We build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations.
We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks.
arXiv Detail & Related papers (2024-09-24T02:24:48Z) - Music Augmentation and Denoising For Peak-Based Audio Fingerprinting [0.0]
We introduce and release a new audio augmentation pipeline that adds noise to music snippets in a realistic way.
We then propose and release a deep learning model that removes noisy components from spectrograms.
We show that the addition of our model improves the identification performance of commonly used audio fingerprinting systems, even under noisy conditions.
arXiv Detail & Related papers (2023-10-20T09:56:22Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection [0.0]
We present a novel approach called You Only Hear Once (YOHO)
We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network.
arXiv Detail & Related papers (2021-09-01T12:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.