Related papers: What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

URL: http://arxiv.org/abs/2512.16832v1
Date: Thu, 18 Dec 2025 18:10:20 GMT
Title: What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels
Authors: Aditya Yadavalli, Tiago Pimentel, Tamar I Regev, Ethan Wilcox, Alex Warstadt,
Abstract summary: Prosody conveys critical information often not captured by the words or text of a message.<n>We propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text.
Score: 29.532302985753102
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance's meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel -- and by implication the prosodic channel -- transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.

Related papers

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation [65.7990140284317]
We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
arXiv Detail & Related papers (2025-11-27T02:00:28Z)
Listening Between the Lines: Decoding Podcast Narratives with Language Modeling [17.51119928424848]
We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture subtle cues that human listeners rely on to identify narrative frames.<n>Our approach then uses these granular frame labels to reveal broader discourse trends.
arXiv Detail & Related papers (2025-11-07T15:12:06Z)
The time scale of redundancy between prosody and linguistic context [22.04241078302997]
We find that a word's prosodic features require an extended past context to be reliably predicted.<n>We also find that a word's prosodic features show some redundancy with future words, but only with a short scale of 1-2 words.
arXiv Detail & Related papers (2025-03-14T17:48:23Z)
Information Theory of Meaningful Communication [0.0]
In Shannon's seminal paper, entropy of printed English, treated as a stationary process, was estimated to be roughly 1 bit per character. In this study, we show that one can leverage recently developed large language models to quantify information communicated in meaningful narratives in terms of bits of meaning per clause.
arXiv Detail & Related papers (2024-11-19T18:51:23Z)
Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z)
Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z)
It's not what you said, it's how you said it: discriminative perception of speech as a multichannel communication system [13.150821247850876]
People convey information extremely effectively through spoken interaction using the lexical channel of what is said, and the non-lexical channel of how it is said. We propose studying human perception of spoken communication as a means to better understand how information is encoded across these channels. We present a novel behavioural task testing whether listeners can discriminate between the true utterance in a dialogue and utterances sampled from other contexts with the same lexical content.
arXiv Detail & Related papers (2021-05-01T14:30:30Z)
Paragraph-level Commonsense Transformers with Recurrent Memory [77.4133779538797]
We train a discourse-aware model that incorporates paragraph-level information to generate coherent commonsense inferences from narratives. Our results show that PARA-COMET outperforms the sentence-level baselines, particularly in generating inferences that are both coherent and novel.
arXiv Detail & Related papers (2020-10-04T05:24:12Z)
"Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
Unsupervised Speech Decomposition via Triple Information Bottleneck [63.55007056410914]
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm. We propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
arXiv Detail & Related papers (2020-04-23T16:12:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.