Audio Difference Learning for Audio Captioning
- URL: http://arxiv.org/abs/2309.08141v1
- Date: Fri, 15 Sep 2023 04:11:37 GMT
- Title: Audio Difference Learning for Audio Captioning
- Authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda
- Abstract summary: This study introduces a novel training paradigm, audio difference learning, for improving audio captioning.
In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.
- Score: 44.55621877667949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study introduces a novel training paradigm, audio difference learning,
for improving audio captioning. The fundamental concept of the proposed
learning method is to create a feature representation space that preserves the
relationship between audio, enabling the generation of captions that detail
intricate audio information. This method employs a reference audio along with
the input audio, both of which are transformed into feature representations via
a shared encoder. Captions are then generated from these differential features
to describe their differences. Furthermore, a unique technique is proposed that
involves mixing the input audio with additional audio, and using the additional
audio as a reference. This results in the difference between the mixed audio
and the reference audio reverting back to the original input audio. This allows
the original input's caption to be used as the caption for their difference,
eliminating the need for additional annotations for the differences. In the
experiments using the Clotho and ESC50 datasets, the proposed method
demonstrated an improvement in the SPIDEr score by 7% compared to conventional
methods.
Related papers
- Audio Difference Captioning Utilizing Similarity-Discrepancy
Disentanglement [22.924746293106715]
The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content.
We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space.
The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.
arXiv Detail & Related papers (2023-08-23T05:13:25Z) - AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Caption Feature Space Regularization for Audio Captioning [24.40864471466915]
General audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio.
We propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
arXiv Detail & Related papers (2022-04-18T17:07:31Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task.
The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage.
Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.