Related papers: Emotion-Aware Speech Generation with Character-Specific Voices for Comics

Emotion-Aware Speech Generation with Character-Specific Voices for Comics

URL: http://arxiv.org/abs/2509.15253v1
Date: Thu, 18 Sep 2025 05:49:57 GMT
Title: Emotion-Aware Speech Generation with Character-Specific Voices for Comics
Authors: Zhiwen Qian, Jinhua Liang, Huan Zhang,
Abstract summary: This paper presents an end-to-end pipeline for generating character-specific, emotion-aware speech from comics.<n>The proposed system takes full comic volumes as input and produces speech aligned with each character's dialogue and emotional state.
Score: 9.329714655190395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents an end-to-end pipeline for generating character-specific, emotion-aware speech from comics. The proposed system takes full comic volumes as input and produces speech aligned with each character's dialogue and emotional state. An image processing module performs character detection, text recognition, and emotion intensity recognition. A large language model performs dialogue attribution and emotion analysis by integrating visual information with the evolving plot context. Speech is synthesized through a text-to-speech model with distinct voice profiles tailored to each character and emotion. This work enables automated voiceover generation for comics, offering a step toward interactive and immersive comic reading experience.

Related papers

TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation [72.46711449668814]
We introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner.<n>We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction, and speech quality.
arXiv Detail & Related papers (2025-12-23T12:04:23Z)
Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization [21.32336226752075]
Spoken DialogSum is the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels.<n>The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate.<n> Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary.
arXiv Detail & Related papers (2025-12-16T18:54:20Z)
OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction [123.89581506075461]
We propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency.<n> Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction.<n>Our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms.
arXiv Detail & Related papers (2025-05-26T17:55:06Z)
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts [20.281732318265483]
We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue.<n>Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior.<n>We render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives.
arXiv Detail & Related papers (2025-05-22T15:54:42Z)
MoCha: Towards Movie-Grade Talking Character Synthesis [62.007000023747445]
We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text.<n>Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region.<n>We propose MoCha, the first of its kind to generate talking characters.
arXiv Detail & Related papers (2025-03-30T04:22:09Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
Toward accessible comics for blind and low vision readers [0.059584784039407875]
We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content. We generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc.
arXiv Detail & Related papers (2024-07-11T07:50:25Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
Textless Speech Emotion Conversion using Decomposed and Discrete Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space. The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.