It's not what you said, it's how you said it: discriminative perception
of speech as a multichannel communication system
- URL: http://arxiv.org/abs/2105.00260v1
- Date: Sat, 1 May 2021 14:30:30 GMT
- Title: It's not what you said, it's how you said it: discriminative perception
of speech as a multichannel communication system
- Authors: Sarenne Wallbridge, Peter Bell, Catherine Lai
- Abstract summary: People convey information extremely effectively through spoken interaction using the lexical channel of what is said, and the non-lexical channel of how it is said.
We propose studying human perception of spoken communication as a means to better understand how information is encoded across these channels.
We present a novel behavioural task testing whether listeners can discriminate between the true utterance in a dialogue and utterances sampled from other contexts with the same lexical content.
- Score: 13.150821247850876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People convey information extremely effectively through spoken interaction
using multiple channels of information transmission: the lexical channel of
what is said, and the non-lexical channel of how it is said. We propose
studying human perception of spoken communication as a means to better
understand how information is encoded across these channels, focusing on the
question 'What characteristics of communicative context affect listener's
expectations of speech?'. To investigate this, we present a novel behavioural
task testing whether listeners can discriminate between the true utterance in a
dialogue and utterances sampled from other contexts with the same lexical
content. We characterize how perception - and subsequent discriminative
capability - is affected by different degrees of additional contextual
information across both the lexical and non-lexical channel of speech. Results
demonstrate that people can effectively discriminate between different prosodic
realisations, that non-lexical context is informative, and that this channel
provides more salient information than the lexical channel, highlighting the
importance of the non-lexical channel in spoken interaction.
Related papers
- What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels [29.532302985753102]
Prosody conveys critical information often not captured by the words or text of a message.<n>We propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text.
arXiv Detail & Related papers (2025-12-18T18:10:20Z) - Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Disentangling segmental and prosodic factors to non-native speech comprehensibility [11.098498920630782]
Current accent conversion systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics.
We present an AC system that decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics.
We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech.
arXiv Detail & Related papers (2024-08-20T16:43:55Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Quantifying the perceptual value of lexical and non-lexical channels in
speech [10.288091965093816]
This paper introduces a generalised paradigm to study the value of non-lexical information in dialogue across unconstrained lexical content.
We show that non-lexical information produces a consistent effect on expectations of upcoming dialogue.
arXiv Detail & Related papers (2023-07-07T11:44:23Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Disentangling the Impacts of Language and Channel Variability on Speech
Separation Networks [25.662237869109433]
Domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation.
In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels.
arXiv Detail & Related papers (2022-03-30T04:07:23Z) - E-ffective: A Visual Analytic System for Exploring the Emotion and
Effectiveness of Inspirational Speeches [57.279044079196105]
E-ffective is a visual analytic system allowing speaking experts and novices to analyze both the role of speech factors and their contribution in effective speeches.
Two novel visualizations include E-spiral (that shows the emotional shifts in speeches in a visually compact way) and E-script (that connects speech content with key speech delivery information.
arXiv Detail & Related papers (2021-10-28T06:14:27Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Fairness in Rating Prediction by Awareness of Verbal and Gesture Quality
of Public Speeches [5.729787815551408]
We formalize a novel HEterogeneity Metric, HEM, that quantifies the quality of a talk both in the verbal and non-verbal domain.
We show that there is an interesting relationship between HEM and the ratings of TED talks given to speakers by viewers.
We incorporate the HEM metric into the loss function of a neural network with the goal to reduce unfairness in rating predictions with respect to race and gender.
arXiv Detail & Related papers (2020-12-11T06:36:55Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.