Related papers: Quantifying the perceptual value of lexical and non-lexical channels in speech

Quantifying the perceptual value of lexical and non-lexical channels in speech

URL: http://arxiv.org/abs/2307.03534v1
Date: Fri, 7 Jul 2023 11:44:23 GMT
Title: Quantifying the perceptual value of lexical and non-lexical channels in speech
Authors: Sarenne Wallbridge, Peter Bell, Catherine Lai
Abstract summary: This paper introduces a generalised paradigm to study the value of non-lexical information in dialogue across unconstrained lexical content. We show that non-lexical information produces a consistent effect on expectations of upcoming dialogue.
Score: 10.288091965093816
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech is a fundamental means of communication that can be seen to provide two channels for transmitting information: the lexical channel of which words are said, and the non-lexical channel of how they are spoken. Both channels shape listener expectations of upcoming communication; however, directly quantifying their relative effect on expectations is challenging. Previous attempts require spoken variations of lexically-equivalent dialogue turns or conspicuous acoustic manipulations. This paper introduces a generalised paradigm to study the value of non-lexical information in dialogue across unconstrained lexical content. By quantifying the perceptual value of the non-lexical channel with both accuracy and entropy reduction, we show that non-lexical information produces a consistent effect on expectations of upcoming dialogue: even when it leads to poorer discriminative turn judgements than lexical content alone, it yields higher consensus among participants.

Related papers

SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z)
Disentangling segmental and prosodic factors to non-native speech comprehensibility [11.098498920630782]
Current accent conversion systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. We present an AC system that decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech.
arXiv Detail & Related papers (2024-08-20T16:43:55Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
Cognitive Semantic Communication Systems Driven by Knowledge Graph: Principle, Implementation, and Performance Evaluation [74.38561925376996]
Two cognitive semantic communication frameworks are proposed for the single-user and multiple-user communication scenarios. An effective semantic correction algorithm is proposed by mining the inference rule from the knowledge graph. For the multi-user cognitive semantic communication system, a message recovery algorithm is proposed to distinguish messages of different users.
arXiv Detail & Related papers (2023-03-15T12:01:43Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data [0.0]
We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: an unsupervised network that must learn to assign unique representations for lexical items. Strong evidence in favor of lexical learning emerges. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data in an unsupervised manner without ever accessing real training data.
arXiv Detail & Related papers (2022-03-22T06:04:34Z)
It's not what you said, it's how you said it: discriminative perception of speech as a multichannel communication system [13.150821247850876]
People convey information extremely effectively through spoken interaction using the lexical channel of what is said, and the non-lexical channel of how it is said. We propose studying human perception of spoken communication as a means to better understand how information is encoded across these channels. We present a novel behavioural task testing whether listeners can discriminate between the true utterance in a dialogue and utterances sampled from other contexts with the same lexical content.
arXiv Detail & Related papers (2021-05-01T14:30:30Z)
Fairness in Rating Prediction by Awareness of Verbal and Gesture Quality of Public Speeches [5.729787815551408]
We formalize a novel HEterogeneity Metric, HEM, that quantifies the quality of a talk both in the verbal and non-verbal domain. We show that there is an interesting relationship between HEM and the ratings of TED talks given to speakers by viewers. We incorporate the HEM metric into the loss function of a neural network with the goal to reduce unfairness in rating predictions with respect to race and gender.
arXiv Detail & Related papers (2020-12-11T06:36:55Z)
"Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.