Towards Generating Diverse Audio Captions via Adversarial Training
- URL: http://arxiv.org/abs/2212.02033v2
- Date: Fri, 28 Jun 2024 23:18:47 GMT
- Title: Towards Generating Diverse Audio Captions via Adversarial Training
- Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang,
- Abstract summary: We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems.
A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions.
The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
- Score: 33.76154801580643
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
Related papers
- It's Just Another Day: Unique Video Captioning by Discriminative Prompting [70.99367779336256]
Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it.
We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies.
arXiv Detail & Related papers (2024-10-15T15:41:49Z) - An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment [6.977241620071544]
Multimodal large language models have fueled progress in image captioning.
In this work, we show that this ability can be re-purposed for audio captioning.
We introduce a novel methodology for bridging the audiovisual modality gap.
arXiv Detail & Related papers (2024-10-08T12:52:48Z) - Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Caption Feature Space Regularization for Audio Captioning [24.40864471466915]
General audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio.
We propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
arXiv Detail & Related papers (2022-04-18T17:07:31Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - Multi-task Regularization Based on Infrequent Classes for Audio
Captioning [19.50869817974852]
A significant challenge for audio captioning is the distribution of words in the captions.
In this paper we propose two methods to mitigate this class imbalance problem.
We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset.
arXiv Detail & Related papers (2020-07-09T09:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.