Leveraging Pre-trained BERT for Audio Captioning
- URL: http://arxiv.org/abs/2203.02838v1
- Date: Sun, 6 Mar 2022 00:05:58 GMT
- Title: Leveraging Pre-trained BERT for Audio Captioning
- Authors: Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, Jinzheng Zhao, Haohe
Liu, Mark D. Plumbley, Volkan K{\i}l{\i}\c{c}, Wenwu Wang
- Abstract summary: BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model.
Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.
- Score: 45.16535378268039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio captioning aims at using natural language to describe the content of an
audio clip. Existing audio captioning systems are generally based on an
encoder-decoder architecture, in which acoustic information is extracted by an
audio encoder and then a language decoder is used to generate the captions.
Training an audio captioning system often encounters the problem of data
scarcity. Transferring knowledge from pre-trained audio models such as
Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful
method to mitigate this issue. However, there is less attention on exploiting
pre-trained language models for the decoder, compared with the encoder. BERT is
a pre-trained language model that has been extensively used in Natural Language
Processing (NLP) tasks. Nevertheless, the potential of BERT as the language
decoder for audio captioning has not been investigated. In this study, we
demonstrate the efficacy of the pre-trained BERT model for audio captioning.
Specifically, we apply PANNs as the encoder and initialize the decoder from the
public pre-trained BERT models. We conduct an empirical study on the use of
these BERT models for the decoder in the audio captioning model. Our models
achieve competitive results with the existing audio captioning methods on the
AudioCaps dataset.
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Local Information Assisted Attention-free Decoder for Audio Captioning [52.191658157204856]
We present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction.
The proposed method enables the effective use of both global and local information from audio signals.
arXiv Detail & Related papers (2022-01-10T08:55:52Z) - Evaluating Off-the-Shelf Machine Listening and Natural Language Models
for Automated Audio Captioning [16.977616651315234]
A captioning system has to identify various information from the input signal and express it with natural language.
We evaluate the performance of off-the-shelf models with a Transformer-based captioning approach.
arXiv Detail & Related papers (2021-10-14T14:42:38Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval [28.57294189207084]
The goal of audio captioning is to translate input audio into its description using natural language.
The proposed method has succeeded to use a pre-trained language model for audio captioning.
The oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
arXiv Detail & Related papers (2020-12-14T08:27:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.