Non-verbal information in spontaneous speech -- towards a new framework
of analysis
- URL: http://arxiv.org/abs/2403.03522v2
- Date: Wed, 13 Mar 2024 09:50:40 GMT
- Title: Non-verbal information in spontaneous speech -- towards a new framework
of analysis
- Authors: Tirza Biron, Moshe Barboy, Eran Ben-Artzy, Alona Golubchik, Yanir
Marmor, Smadar Szekely, Yaron Winter, David Harel
- Abstract summary: This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals.
We present a classification process that disentangles prosodic phenomena of three orders.
Disentangling prosodic patterns can direct a theory of communication and speech organization.
- Score: 0.5559722082623594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-verbal signals in speech are encoded by prosody and carry information
that ranges from conversation action to attitude and emotion. Despite its
importance, the principles that govern prosodic structure are not yet
adequately understood. This paper offers an analytical schema and a
technological proof-of-concept for the categorization of prosodic signals and
their association with meaning. The schema interprets surface-representations
of multi-layered prosodic events. As a first step towards implementation, we
present a classification process that disentangles prosodic phenomena of three
orders. It relies on fine-tuning a pre-trained speech recognition model,
enabling the simultaneous multi-class/multi-label detection. It generalizes
over a large variety of spontaneous data, performing on a par with, or superior
to, human annotation. In addition to a standardized formalization of prosody,
disentangling prosodic patterns can direct a theory of communication and speech
organization. A welcome by-product is an interpretation of prosody that will
enhance speech- and language-related technologies.
Related papers
- Identifying and interpreting non-aligned human conceptual
representations using language modeling [0.0]
We show that congenital blindness induces conceptual reorganization in both a-modal and sensory-related verbal domains.
We find that blind individuals more strongly associate social and cognitive meanings to verbs related to motion.
For some verbs, representations of blind and sighted are highly similar.
arXiv Detail & Related papers (2024-03-10T13:02:27Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
Conversational Text-to-Speech Synthesis [53.511443791260206]
We propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels.
In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech.
arXiv Detail & Related papers (2023-08-31T09:50:33Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - Disentangling Prosody Representations with Unsupervised Speech
Reconstruction [22.873286925385543]
The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction.
Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec.
We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks.
arXiv Detail & Related papers (2022-12-14T01:37:35Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - Saliency Map Verbalization: Comparing Feature Importance Representations
from Model-free and Instruction-based Methods [6.018950511093273]
Saliency maps can explain a neural model's predictions by identifying important input features.
We formalize the underexplored task of translating saliency maps into natural language.
We compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations.
arXiv Detail & Related papers (2022-10-13T17:48:15Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.