Temporal Sub-sampling of Audio Feature Sequences for Automated Audio
Captioning
- URL: http://arxiv.org/abs/2007.02676v1
- Date: Mon, 6 Jul 2020 12:19:23 GMT
- Title: Temporal Sub-sampling of Audio Feature Sequences for Automated Audio
Captioning
- Authors: Khoa Nguyen and Konstantinos Drossos and Tuomas Virtanen
- Abstract summary: We present an approach that focuses on explicitly taking advantage of the difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence.
We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder.
- Score: 21.603519845525483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio captioning is the task of automatically creating a textual description
for the contents of a general audio signal. Typical audio captioning methods
rely on deep neural networks (DNNs), where the target of the DNN is to map the
input audio sequence to an output sequence of words, i.e. the caption. Though,
the length of the textual description is considerably less than the length of
the audio signal, for example 10 words versus some thousands of audio feature
vectors. This clearly indicates that an output word corresponds to multiple
input feature vectors. In this work we present an approach that focuses on
explicitly taking advantage of this difference of lengths between sequences, by
applying a temporal sub-sampling to the audio input sequence. We employ a
sequence-to-sequence method, which uses a fixed-length vector as an output from
the encoder, and we apply temporal sub-sampling between the RNNs of the
encoder. We evaluate the benefit of our approach by employing the freely
available dataset Clotho and we evaluate the impact of different factors of
temporal sub-sampling. Our results show an improvement to all considered
metrics.
Related papers
- TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - WaveTransformer: A Novel Architecture for Audio Captioning Based on
Learning Temporal and Time-Frequency Information [20.153258692295278]
We present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio.
We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes.
Our results increase previously reported highest SPIDEr to 17.3, from 16.2.
arXiv Detail & Related papers (2020-10-21T16:02:25Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - Listen carefully and tell: an audio captioning system based on residual
learning and gammatone audio representation [4.591851728010269]
An automated audio captioning system has to be implemented as it accepts an audio as input and outputs as textual description.
In this work, an automatic audio captioning based on residual learning on the encoder phase is proposed.
Results show that the framework proposed in this work surpass the baseline system in challenge results.
arXiv Detail & Related papers (2020-06-27T17:16:49Z) - Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task.
The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage.
Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.