Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation
- URL: http://arxiv.org/abs/2512.02523v1
- Date: Tue, 02 Dec 2025 08:32:09 GMT
- Title: Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation
- Authors: Xueyan Li, Yuxin Wang, Mengjie Jiang, Qingzi Zhu, Jiang Zhang, Zoey Kim, Yazhe Niu,
- Abstract summary: We propose a generative feedback framework that provides multi-dimensional language and audio feedback for singing voice synthesis assessment.<n>Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality.<n>The framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement.
- Score: 8.659397003532488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical. However, current methods like reward systems often rely on single numerical scores, struggle to capture various dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization. To address these issues, we propose a generative feedback (i.e., reward model) framework that provides multi-dimensional language and audio feedback for SVS assessment. Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality. The model is fine-tuned on a hybrid dataset combining human music reactions and synthetic critiques from a MLLMs, enhancing diversity and linguistic richness. Quantitative experiments validate the effectiveness of the proposed dataset and training strategy, demonstrating that the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement. The code is at [https://github.com/opendilab/VocalCritic](https://github.com/opendilab/VocalCritic)
Related papers
- YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance [16.462715982402884]
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment.<n>We propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody.<n>Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module.
arXiv Detail & Related papers (2025-12-04T13:25:33Z) - AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning [4.573044937555209]
We propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism.<n>We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation.<n>We demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset.
arXiv Detail & Related papers (2025-07-07T13:09:36Z) - SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture [3.7937714754535503]
SmoothSinger is a conditional diffusion model designed to synthesize high quality and natural singing voices.<n>It refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines.<n> Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results.
arXiv Detail & Related papers (2025-06-26T17:07:45Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound [46.7144966835279]
This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention.<n>We propose new annotation guidelines that decompose human listening perspectives into four distinct axes.<n>We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
arXiv Detail & Related papers (2025-02-07T18:15:57Z) - Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation [8.170174172545831]
This paper addresses issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024.
We present an evaluation protocol combining objective metric, namely Fr'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation.
arXiv Detail & Related papers (2024-10-23T06:35:41Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.<n>StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.<n>Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.