Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval
- URL: http://arxiv.org/abs/2012.07331v1
- Date: Mon, 14 Dec 2020 08:27:36 GMT
- Title: Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval
- Authors: Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi,
Masahiro Yasuda
- Abstract summary: The goal of audio captioning is to translate input audio into its description using natural language.
The proposed method has succeeded to use a pre-trained language model for audio captioning.
The oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
- Score: 28.57294189207084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of audio captioning is to translate input audio into its description
using natural language. One of the problems in audio captioning is the lack of
training data due to the difficulty in collecting audio-caption pairs by
crawling the web. In this study, to overcome this problem, we propose to use a
pre-trained large-scale language model. Since an audio input cannot be directly
inputted into such a language model, we utilize guidance captions retrieved
from a training dataset based on similarities that may exist in different
audio. Then, the caption of the audio input is generated by using a pre-trained
language model while referring to the guidance captions. Experimental results
show that (i) the proposed method has succeeded to use a pre-trained language
model for audio captioning, and (ii) the oracle performance of the pre-trained
model-based caption generator was clearly better than that of the conventional
method trained from scratch.
Related papers
- Learning Audio Concepts from Counterfactual Natural Language [34.118579918018725]
This study introduces causal reasoning and counterfactual analysis in the audio domain.
Our model considers acoustic characteristics and sound source information from human-annotated reference texts.
Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
arXiv Detail & Related papers (2024-01-10T05:15:09Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - An investigation on selecting audio pre-trained models for audio
captioning [5.837881923712393]
Pre-trained models are widely used in audio captioning due to high complexity.
Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system.
In this paper, a series of pre-trained models are investigated for the correlation between extracted audio features and the performance of audio captioning.
arXiv Detail & Related papers (2022-08-12T06:14:20Z) - Leveraging Pre-trained BERT for Audio Captioning [45.16535378268039]
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model.
Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.
arXiv Detail & Related papers (2022-03-06T00:05:58Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.