Automated Audio Captioning: an Overview of Recent Progress and New
Challenges
- URL: http://arxiv.org/abs/2205.05949v1
- Date: Thu, 12 May 2022 08:36:35 GMT
- Title: Automated Audio Captioning: an Overview of Recent Progress and New
Challenges
- Authors: Xinhao Mei, Xubo Liu, Mark D. Plumbley and Wenwu Wang
- Abstract summary: Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips.
We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
- Score: 56.98522404673527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated audio captioning is a cross-modal translation task that aims to
generate natural language descriptions for given audio clips. This task has
received increasing attention with the release of freely available datasets in
recent years. The problem has been addressed predominantly with deep learning
techniques. Numerous approaches have been proposed, such as investigating
different neural network architectures, exploiting auxiliary information such
as keywords or sentence information to guide caption generation, and employing
different training strategies, which have greatly facilitated the development
of this field. In this paper, we present a comprehensive review of the
published contributions in automated audio captioning, from a variety of
existing approaches to evaluation metrics and datasets. Moreover, we discuss
open challenges and envisage possible future research directions.
Related papers
- AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech
Technologies [0.0]
We present AnnoTheia, a semi-automatic annotation toolkit that detects when a person speaks on the scene and the corresponding transcription.
To show the complete process of preparing AnnoTheia for a language of interest, we also describe the adaptation of a pre-trained model for active speaker detection to Spanish.
arXiv Detail & Related papers (2024-02-20T17:07:08Z) - A Whisper transformer for audio captioning trained with synthetic
captions and transfer learning [0.0]
We present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions.
Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model.
arXiv Detail & Related papers (2023-05-15T22:20:07Z) - Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space.
Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z) - Deep Learning for Visual Speech Analysis: A Survey [54.53032361204449]
This paper presents a review of recent progress in deep learning methods on visual speech analysis.
We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance.
arXiv Detail & Related papers (2022-05-22T14:44:53Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.