Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style
- URL: http://arxiv.org/abs/2508.11187v1
- Date: Fri, 15 Aug 2025 03:38:21 GMT
- Title: Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style
- Authors: Wonjune Kang, Deb Roy,
- Abstract summary: We introduce the task of expressive speech retrieval.<n>The goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style.<n>We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space.
- Score: 13.415189715216354
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.
Related papers
- MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis [19.141058309358424]
This study proposes a text-to-speech (TTS) framework based on Retrieval-Augmented Generation (RAG) technology.<n>We have constructed a speech style knowledge database containing high-quality speech samples in various contexts.<n>This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and Moka, to match with samples in the knowledge database, selecting the most appropriate speech style for synthesis.
arXiv Detail & Related papers (2025-04-14T15:18:59Z) - InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training [23.330297074014315]
In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training.<n>InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion.<n>Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.
arXiv Detail & Related papers (2025-03-04T16:34:14Z) - SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description [19.064845530513285]
We propose an automatic speech annotation system for interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions.
Our system provides in-depth understandings of speech style through tailored natural language descriptions.
It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips.
arXiv Detail & Related papers (2024-08-24T15:36:08Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - Speech-Text Dialog Pre-training for Spoken Dialog Understanding with
Explicit Cross-Modal Alignment [54.8991472306962]
We propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA)
SPECTRA is the first-ever speech-text dialog pre-training model.
Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.
arXiv Detail & Related papers (2023-05-19T10:37:56Z) - Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context.
A hierarchical context encoder is proposed to explore a wider range of contextual information.
To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.