Related papers: A Survey of Deep Learning Audio Generation Methods

A Survey of Deep Learning Audio Generation Methods

URL: http://arxiv.org/abs/2406.00146v1
Date: Fri, 31 May 2024 19:20:27 GMT
Title: A Survey of Deep Learning Audio Generation Methods
Authors: Matej Božić, Marko Horvat,
Abstract summary: This article presents a review of typical techniques used in three distinct aspects of deep learning model development for audio generation. In the first part, we provide an explanation of audio representations, beginning with the fundamental audio waveform. We then progress to the frequency domain, with an emphasis on the attributes of human hearing, and finally introduce a relatively recent development.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This article presents a review of typical techniques used in three distinct aspects of deep learning model development for audio generation. In the first part of the article, we provide an explanation of audio representations, beginning with the fundamental audio waveform. We then progress to the frequency domain, with an emphasis on the attributes of human hearing, and finally introduce a relatively recent development. The main part of the article focuses on explaining basic and extended deep learning architecture variants, along with their practical applications in the field of audio generation. The following architectures are addressed: 1) Autoencoders 2) Generative adversarial networks 3) Normalizing flows 4) Transformer networks 5) Diffusion models. Lastly, we will examine four distinct evaluation metrics that are commonly employed in audio generation. This article aims to offer novice readers and beginners in the field a comprehensive understanding of the current state of the art in audio generation methods as well as relevant studies that can be explored for future research.

Related papers

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.19449187588832]
Unified Audio Language Model (UALM) aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model.<n>We first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models.<n>We present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks.
arXiv Detail & Related papers (2025-10-13T22:55:01Z)
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts [59.38012380516272]
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video.<n>To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique.
arXiv Detail & Related papers (2025-09-07T17:55:03Z)
Discrete Audio Tokens: More Than a Survey! [107.69720675124255]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z)
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound [46.7144966835279]
This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention. We propose new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
arXiv Detail & Related papers (2025-02-07T18:15:57Z)
ADIFF: Explaining audio difference using natural language [31.963783032080993]
This paper comprehensively studies the task of explaining audio differences and then propose benchmark, baselines for the task. We present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. We propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations.
arXiv Detail & Related papers (2025-02-06T20:00:43Z)
Learning Audio Concepts from Counterfactual Natural Language [34.118579918018725]
This study introduces causal reasoning and counterfactual analysis in the audio domain. Our model considers acoustic characteristics and sound source information from human-annotated reference texts. Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
arXiv Detail & Related papers (2024-01-10T05:15:09Z)
From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications [3.8301843990331887]
Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. We systematically categorise them into four main approaches: image-driven, audio-driven, video-driven and others. We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations.
arXiv Detail & Related papers (2023-08-30T14:00:48Z)
Learning in Audio-visual Context: A Review, Analysis, and New Perspective [88.40519011197144]
This survey aims to systematically organize and analyze studies of the audio-visual field. We introduce several key findings that have inspired our computational studies. We propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area.
arXiv Detail & Related papers (2022-08-20T02:15:44Z)
BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z)
Automated Audio Captioning: an Overview of Recent Progress and New Challenges [56.98522404673527]
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
arXiv Detail & Related papers (2022-05-12T08:36:35Z)
WaveFake: A Data Set to Facilitate Audio Deepfake Detection [3.8073142980733]
This paper provides an introduction to signal processing techniques used for analyzing audio signals. Second, we present a novel data set, for which we collected nine sample sets from five different network architectures, spanning two languages. Third, we supply practitioners with two baseline models, adopted from the signal processing community, to facilitate further research in this area.
arXiv Detail & Related papers (2021-11-04T12:26:34Z)
A Review of Sound Source Localization with Deep Learning Methods [71.18444724397486]
This article is a review on deep learning methods for single and multiple sound source localization. We provide an exhaustive topography of the neural-based localization literature in this context. Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
arXiv Detail & Related papers (2021-09-08T07:25:39Z)
A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity. With the rise of deep learning technology, more rapid advancements have been made for speaker diarization. We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields. We discuss state-of-the-art methods as well as the remaining challenges of each subfield. We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark [12.034688724153044]
This paper explores post-hoc explanations for deep neural networks in the audio domain. We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits. We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
arXiv Detail & Related papers (2018-07-09T23:11:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.