Related papers: S2Cap: A Benchmark and a Baseline for Singing Style Captioning

S2Cap: A Benchmark and a Baseline for Singing Style Captioning

URL: http://arxiv.org/abs/2409.09866v2
Date: Sat, 15 Feb 2025 15:33:20 GMT
Title: S2Cap: A Benchmark and a Baseline for Singing Style Captioning
Authors: Hyunjong Ok, Jaeho Lee,
Abstract summary: We introduce S2Cap, a singing voice dataset with comprehensive descriptions of diverse vocal, acoustic and demographic attributes. We develop a simple yet effective baseline algorithm for the singing style captioning. Despite its simplicity, the proposed method outperforms state-of-the-art baselines.
Score: 12.515874333424929
License:
Abstract: Singing voices contain much richer information than common voices, such as diverse vocal and acoustic characteristics. However, existing open-source audio-text datasets for singing voices capture only a limited set of attributes and lacks acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally consider the task of singing style captioning and introduce S2Cap, a singing voice dataset with comprehensive descriptions of diverse vocal, acoustic and demographic attributes. Based on this dataset, we develop a simple yet effective baseline algorithm for the singing style captioning. The algorithm utilizes two novel technical components: CRESCENDO for mitigating misalignment between pretrained unimodal models, and demixing supervision to regularize the model to focus on the singing voice. Despite its simplicity, the proposed method outperforms state-of-the-art baselines.

Related papers

Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z)
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control [58.96445085236971]
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles. We introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles.
arXiv Detail & Related papers (2024-09-24T11:18:09Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation. Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations [41.410556997285326]
Karaoker-SSL is a singing voice synthesis model that is trained only on text and speech data. It does not utilize any singing data end-to-end, since its vocoder is also trained on speech data.
arXiv Detail & Related papers (2024-02-02T16:06:24Z)
Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks. We explore different self-supervised learning techniques on a large collection of isolated vocal tracks. We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z)
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles. StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z)
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations. We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.