Related papers: Representation of perceived prosodic similarity of conversational feedback

Representation of perceived prosodic similarity of conversational feedback

URL: http://arxiv.org/abs/2505.13268v1
Date: Mon, 19 May 2025 15:47:51 GMT
Title: Representation of perceived prosodic similarity of conversational feedback
Authors: Livia Qian, Carol Figueroa, Gabriel Skantze,
Abstract summary: spectral and self-supervised speech representations encode prosody better than extracted pitch features.<n>It is possible to further condense and align the representations to human perception through contrastive learning.
Score: 3.7277730514654555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.

Related papers

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis [20.80178325643714]
In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings.<n>We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm.<n>To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns.
arXiv Detail & Related papers (2025-07-02T22:16:42Z)
Learning Speaker-Invariant Visual Features for Lipreading [54.670614643480505]
Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text.<n>Existing lipreading methods often extract speaker-specific lip attributes that introduce spurious correlations between vision and text.<n>We introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes.
arXiv Detail & Related papers (2025-06-09T09:16:14Z)
Pairwise Evaluation of Accent Similarity in Speech Synthesis [11.513055793492418]
We aim to enhance both subjective and objective evaluation methods for accent similarity.<n>We refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs.<n>We utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation.
arXiv Detail & Related papers (2025-05-20T14:23:50Z)
ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification [48.98768967435808]
We use computational method to verify if an utterance matches the identity of an enrolled speaker.<n>Despite much success, we have yet to develop a speaker verification system that offers explainable results.<n>A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait.
arXiv Detail & Related papers (2025-01-10T05:53:37Z)
Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation [4.216085185442862]
In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors. How can we learn meaningful gestures representations considering gestures' variability and relationship with speech? This paper employs self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information.
arXiv Detail & Related papers (2024-08-31T08:53:18Z)
The Curious Case of Representational Alignment: Unravelling Visio-Linguistic Tasks in Emergent Communication [1.3499500088995464]
We assess the representational alignment between agent image representations and agent representations and input images. We identify a strong relationship between inter-agent alignment and topographic similarity, a common metric for compositionality. Our findings emphasise the key role representational alignment plays in simulations of language emergence.
arXiv Detail & Related papers (2024-07-25T11:29:27Z)
Joint Learning of Context and Feedback Embeddings in Spoken Dialogue [3.8673630752805446]
We investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.
arXiv Detail & Related papers (2024-06-11T14:22:37Z)
Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding [112.0878081944858]
Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations. Two highly dissimilar images can be discriminated early in their description, whereas conceptually dissimilar ones will need more detail to be distinguished.
arXiv Detail & Related papers (2024-02-14T03:31:17Z)
Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
arXiv Detail & Related papers (2024-02-11T02:26:43Z)
Accounting for Agreement Phenomena in Sentence Comprehension with Transformer Language Models: Effects of Similarity-based Interference on Surprisal and Attention [4.103438743479001]
We advance an explanation of similarity-based interference effects in subject-verb and reflexive pronoun agreement processing. We show that surprisal of the verb or reflexive pronoun predicts facilitatory interference effects in ungrammatical sentences.
arXiv Detail & Related papers (2021-04-26T20:46:54Z)
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis. We extract low-bitrate representations for speech content, prosodic information, and speaker identity. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z)
Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely. We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.