Related papers: Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate

Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate

URL: http://arxiv.org/abs/2511.10693v1
Date: Wed, 12 Nov 2025 07:44:42 GMT
Title: Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate
Authors: Eyal Rabin, Zohar Elyoseph, Rotem Israel-Fishelson, Adi Dali, Ravit Nussinson,
Abstract summary: This study investigates whether state-of-the-art text-to-speech systems have the human tendency to reduce speech rate to convey politeness.<n>We prompted 22 synthetic voices from two leading AI platforms to read a fixed script under both "polite and formal" and "casual and informal" conditions.<n>Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both "polite and formal" and "casual and informal" conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio's voices and for a large majority of OpenAI's voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.

Related papers

Can You Tell It's AI? Human Perception of Synthetic Voices in Vishing Scenarios [3.2976205772213123]
Large Language Models and commercial speech synthesis systems now enable highly realistic AI-generated voice scams (vishing)<n>Yet it remains unclear whether individuals can reliably distinguish AI-generated speech from human-recorded voices in realistic scam contexts.<n>We conducted a controlled online study in which 22 participants evaluated 16 vishing-style audio clips and classified each as human or AI.
arXiv Detail & Related papers (2026-02-23T17:17:53Z)
Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity [0.0]
Speech remains one of the most visible yet overlooked vectors of inclusion and exclusion in contemporary society.<n>This article focuses on the structural biases that shape perceptions of atypical speech and are now being encoded into artificial intelligence.
arXiv Detail & Related papers (2026-01-26T16:12:25Z)
The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era [95.35748535806744]
We launch the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026.<n>This paper summarizes the dataset, track configurations, and the final results.
arXiv Detail & Related papers (2026-01-09T06:32:30Z)
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play [21.93291433513335]
Voila achieves a response latency just 195 milliseconds, surpassing the average human response time.<n>Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models.<n>Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds.
arXiv Detail & Related papers (2025-05-05T15:05:01Z)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance. We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information. Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z)
MindSpeech: Continuous Imagined Speech Decoding using High-Density fNIRS and Prompt Tuning for Advanced Human-AI Interaction [0.0]
This paper reports a novel method for human-AI interaction by developing a direct brain-AI interface. We discuss a novel AI model, called MindSpeech, which enables open-vocabulary, continuous decoding for imagined speech. We demonstrate significant improvements in key metrics, such as BLEU-1 and BERT P scores, for three out of four participants.
arXiv Detail & Related papers (2024-07-25T16:39:21Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion [4.251500966181852]
This study consists of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. It is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech.
arXiv Detail & Related papers (2023-08-24T12:26:15Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Can Machines Imitate Humans? Integrative Turing-like tests for Language and Vision Demonstrate a Narrowing Gap [56.611702960809644]
We benchmark AI's ability to imitate humans in three language tasks and three vision tasks.<n>Next, we conducted 72,191 Turing-like tests with 1,916 human judges and 10 AI judges.<n>Imitation ability showed minimal correlation with conventional AI performance metrics.
arXiv Detail & Related papers (2022-11-23T16:16:52Z)
Speech Synthesis as Augmentation for Low-Resource ASR [7.2244067948447075]
Speech synthesis might hold the key to low-resource speech recognition. Data augmentation techniques have become an essential part of modern speech recognition training. Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
arXiv Detail & Related papers (2020-12-23T22:19:42Z)
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning. A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics [0.0]
We propose an approach to distinguish human speech from AI synthesized speech. Higher-order statistics have less correlation for human speech in comparison to a synthesized speech. Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
arXiv Detail & Related papers (2020-09-03T21:29:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.