AudioGenX: Explainability on Text-to-Audio Generative Models
- URL: http://arxiv.org/abs/2502.00459v2
- Date: Tue, 04 Feb 2025 04:00:01 GMT
- Title: AudioGenX: Explainability on Text-to-Audio Generative Models
- Authors: Hyunju Kang, Geonhee Han, Yoonjae Jeong, Hogun Park,
- Abstract summary: We introduce AudioGenX, an Explainable AI (XAI) that provides explanations for text-to-audio generation models by highlighting the importance of input tokens.
This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs.
- Score: 2.9873893715462185
- License:
- Abstract: Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.
Related papers
- Synthetic Audio Helps for Cognitive State Tasks [5.372301053935417]
We show that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio.
We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training.
arXiv Detail & Related papers (2025-02-10T17:16:24Z) - ADIFF: Explaining audio difference using natural language [31.963783032080993]
This paper comprehensively studies the task of explaining audio differences and then propose benchmark, baselines for the task.
We present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets.
We propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations.
arXiv Detail & Related papers (2025-02-06T20:00:43Z) - Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Voice Attribute Editing with Text Prompt [48.48628304530097]
This paper introduces a novel task: voice attribute editing with text prompt.
The goal is to make relative modifications to voice attributes according to the actions described in the text prompt.
To solve this task, VoxEditor, an end-to-end generative model, is proposed.
arXiv Detail & Related papers (2024-04-13T00:07:40Z) - On The Open Prompt Challenge In Conditional Audio Generation [25.178010153697976]
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text.
We treat TTA models as a blackbox'' and address the user prompt challenge with two key insights.
We propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements.
arXiv Detail & Related papers (2023-11-01T23:33:25Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Enhance audio generation controllability through representation
similarity regularization [23.320569279485472]
We propose an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training.
Our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.
arXiv Detail & Related papers (2023-09-15T21:32:20Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.