Related papers: Seeing What You Say: Expressive Image Generation from Speech

Seeing What You Say: Expressive Image Generation from Speech

URL: http://arxiv.org/abs/2511.03423v1
Date: Wed, 05 Nov 2025 12:40:28 GMT
Title: Seeing What You Say: Expressive Image Generation from Speech
Authors: Jiyoung Lee, Song Park, Sanghyuk Chun, Soo-Whan Chung,
Abstract summary: VoxStudio generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information.<n>By operating directly on semantic tokens, VoxStudio eliminates the need for an additional speech-to-text system.<n>We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine.
Score: 39.6782945295833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, which compresses raw speech into compact semantic tokens, preserving prosody and emotional nuance. By operating directly on these tokens, VoxStudio eliminates the need for an additional speech-to-text system, which often ignores the hidden details beyond text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine to affordably generate richly expressive utterances. Comprehensive experiments on the SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility of our method and highlight key challenges, including emotional consistency and linguistic ambiguity, paving the way for future research.

Related papers

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion [49.55774551366049]
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation.<n>We propose an EmotiveTalk framework to address these issues.<n> Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation.
arXiv Detail & Related papers (2024-11-23T04:38:51Z)
DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage [7.096838107088313]
DisfluencySpeech is a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard)
arXiv Detail & Related papers (2024-06-13T05:23:22Z)
StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations [12.891344121936902]
We introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations.
arXiv Detail & Related papers (2024-04-23T11:41:35Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
Contextual Expressive Text-to-Speech [25.050361896378533]
We introduce a new task setting, Contextual Text-to-speech (CTTS) The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. We construct a synthetic dataset and develop an effective framework to generate high-quality expressive speech based on the given context.
arXiv Detail & Related papers (2022-11-26T12:06:21Z)
Textless Speech Emotion Conversion using Decomposed and Discrete Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.