Related papers: BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Related papers

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens [62.56027815951259]
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens.<n>This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale.
arXiv Detail & Related papers (2026-02-18T18:32:46Z)
Aligning Audio Captions with Human Preferences [7.207308857839664]
We propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback.<n>We train a Contrastive Language-Audio Pretraining (CLAP)-based reward model using human-labeled pairwise preference data.<n>Our framework achieves performance comparable to supervised approaches with ground-truth data, demonstrating its effectiveness in aligning audio captioning with human preferences.
arXiv Detail & Related papers (2025-09-18T06:33:44Z)
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models [38.963121219471354]
We present DreamAudio for customized text-to-audio generation (CTTA)<n>We introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation.<n>Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events.
arXiv Detail & Related papers (2025-09-07T12:06:21Z)
Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings [32.813673146878685]
We present a system for automatic multi-axis perceptual quality prediction of generative audio.<n>The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems.
arXiv Detail & Related papers (2025-09-03T13:19:56Z)
In-the-wild Audio Spatialization with Flexible Text-guided Localization [37.60344400859993]
To enhance immersive experiences, audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications.<n>While existing audio spatialization methods can generally map any available monaural audio to audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments.<n>We propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives.
arXiv Detail & Related papers (2025-06-01T09:41:56Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
ETTA: Elucidating the Design Space of Text-to-Audio Models [33.831803213869605]
We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks. We propose our best model dubbed Elucidated Text-To-Audio (ETTA) ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
arXiv Detail & Related papers (2024-12-26T21:13:12Z)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities. Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z)
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content. We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z)
FALL-E: A Foley Sound Synthesis Model and Strategies [0.5599792629509229]
The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input.
arXiv Detail & Related papers (2023-06-16T12:44:10Z)
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval [11.161404854726348]
We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text.
arXiv Detail & Related papers (2022-10-06T11:45:14Z)
Using multiple reference audios and style embedding constraints for speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios. The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z)
Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement [31.33429812278942]
The proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively. Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning multi-speaker speech synthesis model.
arXiv Detail & Related papers (2020-05-26T06:14:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.