Related papers: ETTA: Elucidating the Design Space of Text-to-Audio Models

ETTA: Elucidating the Design Space of Text-to-Audio Models

URL: http://arxiv.org/abs/2412.19351v1
Date: Thu, 26 Dec 2024 21:13:12 GMT
Title: ETTA: Elucidating the Design Space of Text-to-Audio Models
Authors: Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro,
Abstract summary: We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks.<n>We propose our best model dubbed Elucidated Text-To-Audio (ETTA)<n>ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
Score: 33.831803213869605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.

Related papers

Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos [3.2472293599354596]
This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event localization and Detection in Regular Video Content.<n>SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions.<n>To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs.
arXiv Detail & Related papers (2025-07-07T10:08:57Z)
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation [32.05517352929237]
State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices.<n>This study focuses exclusively on the modeling paradigm.<n>We compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching.
arXiv Detail & Related papers (2025-06-10T08:37:45Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder. Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models [30.68516200579894]
We introduce CM-TTS, a novel architecture grounded in consistency models (CMs) CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations.
arXiv Detail & Related papers (2024-03-31T05:38:08Z)
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities. RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
FALL-E: A Foley Sound Synthesis Model and Strategies [0.5599792629509229]
The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input.
arXiv Detail & Related papers (2023-06-16T12:44:10Z)
Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. Recent developments have enabled the use of more naturalistic training data for computational models. It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.