Introducing Auxiliary Text Query-modifier to Content-based Audio
Retrieval
- URL: http://arxiv.org/abs/2207.09732v1
- Date: Wed, 20 Jul 2022 08:19:54 GMT
- Title: Introducing Auxiliary Text Query-modifier to Content-based Audio
Retrieval
- Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio
Kashino
- Abstract summary: The amount of audio data available on public websites is growing rapidly.
We propose a content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio.
- Score: 37.02112904035811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The amount of audio data available on public websites is growing rapidly, and
an efficient mechanism for accessing the desired data is necessary. We propose
a content-based audio retrieval method that can retrieve a target audio that is
similar to but slightly different from the query audio by introducing auxiliary
textual information which describes the difference between the query and target
audio. While the range of conventional content-based audio retrieval is limited
to audio that is similar to the query audio, the proposed method can adjust the
retrieval range by adding an embedding of the auxiliary text query-modifier to
the embedding of the query sample audio in a shared latent space. To evaluate
our method, we built a dataset comprising two different audio clips and the
text that describes the difference. The experimental results show that the
proposed method retrieves the paired audio more accurately than the baseline.
We also confirmed based on visualization that the proposed method obtains the
shared latent space in which the audio difference and the corresponding text
are represented as similar embedding vectors.
Related papers
- Audio Captioning via Generative Pair-to-Pair Retrieval with Refined Knowledge Base [0.0]
Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base and augments them with query audio to generate accurate textual responses.
We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs.
Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD.
arXiv Detail & Related papers (2024-10-14T04:57:32Z) - Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval [3.997809845676912]
Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics.
This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals.
arXiv Detail & Related papers (2024-06-22T17:19:51Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Audio Difference Learning for Audio Captioning [44.55621877667949]
This study introduces a novel training paradigm, audio difference learning, for improving audio captioning.
In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.
arXiv Detail & Related papers (2023-09-15T04:11:37Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Audio Retrieval with WavText5K and CLAP Training [8.362098382773265]
We propose a new collection of about five thousand web audio-text pairs that we refer to as WavText5K.
When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets.
Our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective.
arXiv Detail & Related papers (2022-09-28T17:39:26Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.