StyleCap: Automatic Speaking-Style Captioning from Speech Based on
Speech and Language Self-supervised Learning Models
- URL: http://arxiv.org/abs/2311.16509v2
- Date: Wed, 27 Dec 2023 14:41:23 GMT
- Title: StyleCap: Automatic Speaking-Style Captioning from Speech Based on
Speech and Language Self-supervised Learning Models
- Authors: Kazuki Yamauchi, Yusuke Ijima, Yuki Saito
- Abstract summary: StyleCap is a method to generate natural language descriptions of speaking styles appearing in speech.
StyleCap is trained with paired data of speech and natural language descriptions.
- Score: 17.945821635380614
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose StyleCap, a method to generate natural language descriptions of
speaking styles appearing in speech. Although most of conventional techniques
for para-/non-linguistic information recognition focus on the category
classification or the intensity estimation of pre-defined labels, they cannot
provide the reasoning of the recognition result in an interpretable manner.
StyleCap is a first step towards an end-to-end method for generating
speaking-style prompts from speech, i.e., automatic speaking-style captioning.
StyleCap is trained with paired data of speech and natural language
descriptions. We train neural networks that convert a speech representation
vector into prefix vectors that are fed into a large language model (LLM)-based
text decoder. We explore an appropriate text decoder and speech feature
representation suitable for this new task. The experimental results demonstrate
that our StyleCap leveraging richer LLMs for the text decoder, speech
self-supervised learning (SSL) features, and sentence rephrasing augmentation
improves the accuracy and diversity of generated speaking-style captions.
Samples of speaking-style captions generated by our StyleCap are publicly
available.
Related papers
- DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
Spoken language models (SLMs) process text and speech, enabling simultaneous speech understanding and generation.
DC-Spin aims to improve speech tokenization by bridging audio signals and SLM tokens.
We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation.
arXiv Detail & Related papers (2024-10-31T17:43:13Z) - Factor-Conditioned Speaking-Style Captioning [32.67274840212351]
We introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors.
FCC generates a caption to ensure the model explicitly learns speaking-style factors.
We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy.
arXiv Detail & Related papers (2024-06-27T05:52:10Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data [2.6217304977339473]
We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
arXiv Detail & Related papers (2023-09-06T05:33:54Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with
Unpaired Stylistic Corpora [37.53634609063878]
We propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap)
A conditional variational auto-encoder is then used to automatically diverse stylistic patterns in latent space.
Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances.
arXiv Detail & Related papers (2023-08-02T13:33:20Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.