Automatic Audio Captioning using Attention weighted Event based
Embeddings
- URL: http://arxiv.org/abs/2201.12352v1
- Date: Fri, 28 Jan 2022 05:54:19 GMT
- Title: Automatic Audio Captioning using Attention weighted Event based
Embeddings
- Authors: Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu
- Abstract summary: We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC.
Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
- Score: 25.258177951665594
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic Audio Captioning (AAC) refers to the task of translating audio into
a natural language that describes the audio events, source of the events and
their relationships. The limited samples in AAC datasets at present, has set up
a trend to incorporate transfer learning with Audio Event Detection (AED) as a
parent task. Towards this direction, in this paper, we propose an
encoder-decoder architecture with light-weight (i.e. with lesser learnable
parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two
state-of-the-art pre-trained AED models as embedding extractors. Our results
show that an efficient AED based embedding extractor combined with temporal
attention and augmentation techniques is able to surpass existing literature
with computationally intensive architectures. Further, we provide evidence of
the ability of the non-uniform attention weighted encoding generated as a part
of our model to facilitate the decoder glance over specific sections of the
audio while generating each token.
Related papers
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Interactive Audio-text Representation for Automated Audio Captioning
with Contrastive Learning [25.06635361326706]
We propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation.
The proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information.
We also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions.
arXiv Detail & Related papers (2022-03-29T13:06:46Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task.
The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage.
Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.