A Whisper transformer for audio captioning trained with synthetic
captions and transfer learning
- URL: http://arxiv.org/abs/2305.09690v1
- Date: Mon, 15 May 2023 22:20:07 GMT
- Title: A Whisper transformer for audio captioning trained with synthetic
captions and transfer learning
- Authors: Marek Kadl\v{c}\'ik, Adam H\'ajek, J\"urgen Kieslich, Rados{\l}aw
Winiecki
- Abstract summary: We present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions.
Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of audio captioning has seen significant advancements in recent
years, driven by the availability of large-scale audio datasets and
advancements in deep learning techniques. In this technical report, we present
our approach to audio captioning, focusing on the use of a pretrained
speech-to-text Whisper model and pretraining on synthetic captions. We discuss
our training procedures and present our experiments' results, which include
model size variations, dataset mixtures, and other hyperparameters. Our
findings demonstrate the impact of different training strategies on the
performance of the audio captioning model. Our code and trained models are
publicly available on GitHub and Hugging Face Hub.
Related papers
- EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation [3.696171835644556]
We introduce EmotionCaps, an audio captioning dataset comprised of approximately 120,000 audio clips with paired synthetic descriptions enriched with soundscape emotion recognition information.
Our findings challenge current approaches to captioning and suggest new directions for developing and assessing captioning models.
arXiv Detail & Related papers (2024-10-15T19:57:37Z) - Generative Adversarial Training for Text-to-Speech Synthesis Based on
Raw Phonetic Input and Explicit Prosody Modelling [0.36868085124383626]
We describe an end-to-end speech synthesis system that uses generative adversarial training.
We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling.
arXiv Detail & Related papers (2023-10-14T18:15:51Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space.
Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z) - Automated Audio Captioning: an Overview of Recent Progress and New
Challenges [56.98522404673527]
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips.
We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
arXiv Detail & Related papers (2022-05-12T08:36:35Z) - Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval [28.57294189207084]
The goal of audio captioning is to translate input audio into its description using natural language.
The proposed method has succeeded to use a pre-trained language model for audio captioning.
The oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
arXiv Detail & Related papers (2020-12-14T08:27:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.