Related papers: SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description

SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description

URL: http://arxiv.org/abs/2408.13608v1
Date: Sat, 24 Aug 2024 15:36:08 GMT
Title: SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
Authors: Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, Zhiyong Wu,
Abstract summary: We propose an automatic speech annotation system for interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Our system provides in-depth understandings of speech style through tailored natural language descriptions. It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips.
Score: 19.064845530513285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LLaMA for customized annotation generation. Unlike previous tag/templet-based annotation frameworks with limited information and diversity, our system provides in-depth understandings of speech style through tailored natural language descriptions, thereby enabling accurate and voluminous data generation for large model training. With this system, we create SpeechCraft, a fine-grained bilingual expressive speech dataset. It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips. Extensive experiments demonstrate that the proposed dataset significantly boosts speech-language task performance in stylist speech synthesis and speech style understanding.

Related papers

Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style [13.415189715216354]
We introduce the task of expressive speech retrieval.<n>The goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style.<n>We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space.
arXiv Detail & Related papers (2025-08-15T03:38:21Z)
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information [47.950757976473035]
We introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information. In addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols.
arXiv Detail & Related papers (2025-03-07T02:07:00Z)
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation [38.88908101517807]
Our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details.
arXiv Detail & Related papers (2025-02-01T09:24:32Z)
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) We present a simple yet effective automatic process for creating speech-text pair data. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach [14.5696754689252]
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations.
arXiv Detail & Related papers (2024-09-16T10:29:15Z)
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities. Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z)
DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage [7.096838107088313]
DisfluencySpeech is a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard)
arXiv Detail & Related papers (2024-06-13T05:23:22Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Natural language guidance of high-fidelity text-to-speech with synthetic annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features [13.44542301438426]
We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information. Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach.
arXiv Detail & Related papers (2022-12-12T10:03:10Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.