WaveTransformer: A Novel Architecture for Audio Captioning Based on
Learning Temporal and Time-Frequency Information
- URL: http://arxiv.org/abs/2010.11098v1
- Date: Wed, 21 Oct 2020 16:02:25 GMT
- Title: WaveTransformer: A Novel Architecture for Audio Captioning Based on
Learning Temporal and Time-Frequency Information
- Authors: An Tran and Konstantinos Drossos and Tuomas Virtanen
- Abstract summary: We present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio.
We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes.
Our results increase previously reported highest SPIDEr to 17.3, from 16.2.
- Score: 20.153258692295278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated audio captioning (AAC) is a novel task, where a method takes as an
input an audio sample and outputs a textual description (i.e. a caption) of its
contents. Most AAC methods are adapted from from image captioning of machine
translation fields. In this work we present a novel AAC novel method,
explicitly focused on the exploitation of the temporal and time-frequency
patterns in audio. We employ three learnable processes for audio encoding, two
for extracting the local and temporal information, and one to merge the output
of the previous two processes. To generate the caption, we employ the widely
used Transformer decoder. We assess our method utilizing the freely available
splits of Clotho dataset. Our results increase previously reported highest
SPIDEr to 17.3, from 16.2.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Local Information Assisted Attention-free Decoder for Audio Captioning [52.191658157204856]
We present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction.
The proposed method enables the effective use of both global and local information from audio signals.
arXiv Detail & Related papers (2022-01-10T08:55:52Z) - Evaluating Off-the-Shelf Machine Listening and Natural Language Models
for Automated Audio Captioning [16.977616651315234]
A captioning system has to identify various information from the input signal and express it with natural language.
We evaluate the performance of off-the-shelf models with a Transformer-based captioning approach.
arXiv Detail & Related papers (2021-10-14T14:42:38Z) - CL4AC: A Contrastive Loss for Audio Captioning [43.83939284740561]
We propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC)
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts.
Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-07-21T10:13:02Z) - Temporal Sub-sampling of Audio Feature Sequences for Automated Audio
Captioning [21.603519845525483]
We present an approach that focuses on explicitly taking advantage of the difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence.
We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder.
arXiv Detail & Related papers (2020-07-06T12:19:23Z) - Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task.
The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage.
Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.