A Transformer-based Audio Captioning Model with Keyword Estimation
- URL: http://arxiv.org/abs/2007.00222v2
- Date: Sat, 8 Aug 2020 06:38:00 GMT
- Title: A Transformer-based Audio Captioning Model with Keyword Estimation
- Authors: Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda,
Shoichiro Saito
- Abstract summary: One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene.
We propose a Transformer-based audio-captioning model with keyword estimation called TRACKE.
It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification.
- Score: 36.507981376481354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the problems with automated audio captioning (AAC) is the
indeterminacy in word selection corresponding to the audio event/scene. Since
one acoustic event/scene can be described with several words, it results in a
combinatorial explosion of possible captions and difficulty in training. To
solve this problem, we propose a Transformer-based audio-captioning model with
keyword estimation called TRACKE. It simultaneously solves the word-selection
indeterminacy problem with the main task of AAC while executing the sub-task of
acoustic event detection/acoustic scene classification (i.e., keyword
estimation). TRACKE estimates keywords, which comprise a word set corresponding
to audio events/scenes in the input audio, and generates the caption while
referring to the estimated keywords to reduce word-selection indeterminacy.
Experimental results on a public AAC dataset indicate that TRACKE achieved
state-of-the-art performance and successfully estimated both the caption and
its keywords.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - Interactive Audio-text Representation for Automated Audio Captioning
with Contrastive Learning [25.06635361326706]
We propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation.
The proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information.
We also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions.
arXiv Detail & Related papers (2022-03-29T13:06:46Z) - Evaluating Off-the-Shelf Machine Listening and Natural Language Models
for Automated Audio Captioning [16.977616651315234]
A captioning system has to identify various information from the input signal and express it with natural language.
We evaluate the performance of off-the-shelf models with a Transformer-based captioning approach.
arXiv Detail & Related papers (2021-10-14T14:42:38Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Acoustic Word Embedding System for Code-Switching Query-by-example
Spoken Term Detection [17.54377669932433]
We propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection.
We combine audio data in two languages for training instead of only using one single language.
arXiv Detail & Related papers (2020-05-24T15:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.