Multi-task Regularization Based on Infrequent Classes for Audio
Captioning
- URL: http://arxiv.org/abs/2007.04660v1
- Date: Thu, 9 Jul 2020 09:38:54 GMT
- Title: Multi-task Regularization Based on Infrequent Classes for Audio
Captioning
- Authors: Emre \c{C}ak{\i}r and Konstantinos Drossos and Tuomas Virtanen
- Abstract summary: A significant challenge for audio captioning is the distribution of words in the captions.
In this paper we propose two methods to mitigate this class imbalance problem.
We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset.
- Score: 19.50869817974852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio captioning is a multi-modal task, focusing on using natural language
for describing the contents of general audio. Most audio captioning methods are
based on deep neural networks, employing an encoder-decoder scheme and a
dataset with audio clips and corresponding natural language descriptions (i.e.
captions). A significant challenge for audio captioning is the distribution of
words in the captions: some words are very frequent but acoustically
non-informative, i.e. the function words (e.g. "a", "the"), and other words are
infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In
this paper we propose two methods to mitigate this class imbalance problem.
First, in an autoencoder setting for audio captioning, we weigh each word's
contribution to the training loss inversely proportional to its number of
occurrences in the whole dataset. Secondly, in addition to multi-class,
word-level audio captioning task, we define a multi-label side task based on
clip-level content word detection by training a separate decoder. We use the
loss from the second task to regularize the jointly trained encoder for the
audio captioning task. We evaluate our method using Clotho, a recently
published, wide-scale audio captioning dataset, and our results show an
increase of 37\% relative improvement with SPIDEr metric over the baseline
method.
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems.
A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions.
The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z) - Caption Feature Space Regularization for Audio Captioning [24.40864471466915]
General audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio.
We propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
arXiv Detail & Related papers (2022-04-18T17:07:31Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.