Efficient Audio Captioning Transformer with Patchout and Text Guidance
- URL: http://arxiv.org/abs/2304.02916v1
- Date: Thu, 6 Apr 2023 07:58:27 GMT
- Title: Efficient Audio Captioning Transformer with Patchout and Text Guidance
- Authors: Thodoris Kouzelis, Grigoris Bastas, Athanasios Katsamanis and
Alexandros Potamianos
- Abstract summary: We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
- Score: 74.59739661383726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated audio captioning is multi-modal translation task that aim to
generate textual descriptions for a given audio clip. In this paper we propose
a full Transformer architecture that utilizes Patchout as proposed in [1],
significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted
by a pre-trained classification model which is fine-tuned to maximize the
semantic similarity between AudioSet labels and ground truth captions. To
mitigate the data scarcity problem of Automated Audio Captioning we introduce
transfer learning from an upstream audio-related task and an enlarged in-domain
dataset. Moreover, we propose a method to apply Mixup augmentation for AAC.
Ablation studies are carried out to investigate how Patchout and text guidance
contribute to the final performance. The results show that the proposed
techniques improve the performance of our system and while reducing the
computational complexity. Our proposed method received the Judges Award at the
Task6A of DCASE Challenge 2022.
Related papers
- Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space.
Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z) - Interactive Audio-text Representation for Automated Audio Captioning
with Contrastive Learning [25.06635361326706]
We propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation.
The proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information.
We also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions.
arXiv Detail & Related papers (2022-03-29T13:06:46Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - CL4AC: A Contrastive Loss for Audio Captioning [43.83939284740561]
We propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC)
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts.
Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-07-21T10:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.