Audio Captioning Transformer
- URL: http://arxiv.org/abs/2107.09817v1
- Date: Wed, 21 Jul 2021 00:31:50 GMT
- Title: Audio Captioning Transformer
- Authors: Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley and Wenwu Wang
- Abstract summary: Audio captioning aims to automatically generate a natural language description of an audio clip.
Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder.
We propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
- Score: 44.68751180694813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio captioning aims to automatically generate a natural language
description of an audio clip. Most captioning models follow an encoder-decoder
architecture, where the decoder predicts words based on the audio features
extracted by the encoder. Convolutional neural networks (CNNs) and recurrent
neural networks (RNNs) are often used as the audio encoder. However, CNNs can
be limited in modelling temporal relationships among the time frames in an
audio signal, while RNNs can be limited in modelling the long-range
dependencies among the time frames. In this paper, we propose an Audio
Captioning Transformer (ACT), which is a full Transformer network based on an
encoder-decoder architecture and is totally convolution-free. The proposed
method has a better ability to model the global information within an audio
signal as well as capture temporal relationships between audio events. We
evaluate our model on AudioCaps, which is the largest audio captioning dataset
publicly available. Our model shows competitive performance compared to other
state-of-the-art approaches.
Related papers
- SNAC: Multi-Scale Neural Audio Codec [1.0753191494611891]
Multi-Scale Neural Audio Codec is a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
This paper proposes Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
arXiv Detail & Related papers (2024-10-18T12:24:05Z) - Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second.
We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - Transformer Transducer: A Streamable Speech Recognition Model with
Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently.
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z) - RawNet: Fast End-to-End Neural Vocoder [4.507860128918788]
RawNet is a complete end-to-end neural vocoder based on the auto-encoder structure for speaker-dependent and -independent speech synthesis.
It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner.
arXiv Detail & Related papers (2019-04-10T10:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.