Caption Feature Space Regularization for Audio Captioning
- URL: http://arxiv.org/abs/2204.08409v1
- Date: Mon, 18 Apr 2022 17:07:31 GMT
- Title: Caption Feature Space Regularization for Audio Captioning
- Authors: Yiming Zhang, Hong Yu, Ruoyi Du, Zhanyu Ma, Yuan Dong
- Abstract summary: General audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio.
We propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
- Score: 24.40864471466915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio captioning aims at describing the content of audio clips with human
language. Due to the ambiguity of audio, different people may perceive the same
audio differently, resulting in caption disparities (i.e., one audio may
correlate to several captions with diverse semantics). For that, general audio
captioning models achieve the one-to-many training by randomly selecting a
correlated caption as the ground truth for each audio. However, it leads to a
significant variation in the optimization directions and weakens the model
stability. To eliminate this negative effect, in this paper, we propose a
two-stage framework for audio captioning: (i) in the first stage, via the
contrastive learning, we construct a proxy feature space to reduce the
distances between captions correlated to the same audio, and (ii) in the second
stage, the proxy feature space is utilized as additional supervision to
encourage the model to be optimized in the direction that benefits all the
correlated captions. We conducted extensive experiments on two datasets using
four commonly used encoder and decoder architectures. Experimental results
demonstrate the effectiveness of the proposed method. The code is available at
https://github.com/PRIS-CV/Caption-Feature-Space-Regularization.
Related papers
- Audio Difference Learning for Audio Captioning [44.55621877667949]
This study introduces a novel training paradigm, audio difference learning, for improving audio captioning.
In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.
arXiv Detail & Related papers (2023-09-15T04:11:37Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Audio Difference Captioning Utilizing Similarity-Discrepancy
Disentanglement [22.924746293106715]
The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content.
We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space.
The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.
arXiv Detail & Related papers (2023-08-23T05:13:25Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems.
A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions.
The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.