Local Information Assisted Attention-free Decoder for Audio Captioning
- URL: http://arxiv.org/abs/2201.03217v1
- Date: Mon, 10 Jan 2022 08:55:52 GMT
- Title: Local Information Assisted Attention-free Decoder for Audio Captioning
- Authors: Feiyang Xiao, Jian Guan, Qiaoxi Zhu, Haiyan Lan, Wenwu Wang
- Abstract summary: We present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction.
The proposed method enables the effective use of both global and local information from audio signals.
- Score: 52.191658157204856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated audio captioning (AAC) aims to describe audio data with captions
using natural language. Most existing AAC methods adopt an encoder-decoder
structure, where the attention based mechanism is a popular choice in the
decoder (e.g., Transformer decoder) for predicting captions from audio
features. Such attention based decoders can capture the global information from
the audio features, however, their ability in extracting local information can
be limited, which may lead to degraded quality in the generated captions. In
this paper, we present an AAC method with an attention-free decoder, where an
encoder based on PANNs is employed for audio feature extraction, and the
attention-free decoder is designed to introduce local information. The proposed
method enables the effective use of both global and local information from
audio signals. Experiments show that our method outperforms the
state-of-the-art methods with the standard attention based decoder in Task 6 of
the DCASE 2021 Challenge.
Related papers
- Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words [10.2138250640885]
We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords in text prompts.
We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder.
arXiv Detail & Related papers (2024-08-15T08:50:58Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Leveraging Pre-trained BERT for Audio Captioning [45.16535378268039]
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model.
Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.
arXiv Detail & Related papers (2022-03-06T00:05:58Z) - CL4AC: A Contrastive Loss for Audio Captioning [43.83939284740561]
We propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC)
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts.
Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-07-21T10:13:02Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - WaveTransformer: A Novel Architecture for Audio Captioning Based on
Learning Temporal and Time-Frequency Information [20.153258692295278]
We present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio.
We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes.
Our results increase previously reported highest SPIDEr to 17.3, from 16.2.
arXiv Detail & Related papers (2020-10-21T16:02:25Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.