S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature
- URL: http://arxiv.org/abs/2603.00958v1
- Date: Sun, 01 Mar 2026 07:24:16 GMT
- Title: S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature
- Authors: Abigail Berthe-Pardo, Gaspard Michel, Elena V. Epure, Christophe Cerisara,
- Abstract summary: We present S-VoCAL, the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes.<n>S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg.<n>Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health.
- Score: 9.09344103114193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character's age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems' performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at https://github.com/AbigailBerthe/S-VoCAL .
Related papers
- AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z) - Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations [1.1740676332544449]
We propose Amadeus, a training-free framework for role-playing agents (RPAs)<n>A Amadeus framework can significantly enhance persona consistency even when responding to questions that lie beyond a character's knowledge.
arXiv Detail & Related papers (2025-08-04T03:27:05Z) - SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [72.79816494079833]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z) - VANPY: Voice Analysis Framework [0.0]
We develop the VANPY framework for automated pre-processing, feature extraction, and classification of voice data.<n>Four of the framework's components were developed in-house and integrated into the framework to extend speaker characterization capabilities.<n>We demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie "Pulp Fiction"
arXiv Detail & Related papers (2025-02-17T21:12:57Z) - BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.<n>Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.<n>We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - Improving Quotation Attribution with Fictional Character Embeddings [11.259583037191772]
We propose to augment a popular quotation attribution system, BookNLP, with character embeddings that encode global stylistic information of characters.
We show that combining BookNLP's contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric and implicit quotes.
arXiv Detail & Related papers (2024-06-17T09:46:35Z) - Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations.<n>We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics.<n>We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Deep Learning for Prominence Detection in Children's Read Speech [13.041607703862724]
We consider a labeled dataset of children's reading recordings for the speaker-independent detection of prominent words.
A previous well-tuned random forest ensemble predictor is replaced by an RNN sequence to exploit potential context dependency.
Deep learning is applied to obtain word-level features from low-level acoustic contours of fundamental frequency, intensity and spectral shape.
arXiv Detail & Related papers (2021-04-12T14:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.