Unsupervised Auditory and Semantic Entrainment Models with Deep Neural
Networks
- URL: http://arxiv.org/abs/2312.15098v1
- Date: Fri, 22 Dec 2023 22:33:54 GMT
- Title: Unsupervised Auditory and Semantic Entrainment Models with Deep Neural
Networks
- Authors: Jay Kejriwal, Stefan Benus, Lina M. Rojas-Barahona
- Abstract summary: We present an unsupervised deep learning framework that derives meaningful representation from textual features for developing semantic entrainment.
The results show that semantic entrainment can be assessed with our model, that models can distinguish between HH and HM interactions and that the two units of analysis for extracting acoustic features provide comparable findings.
- Score: 0.3222802562733786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speakers tend to engage in adaptive behavior, known as entrainment, when they
become similar to their interlocutor in various aspects of speaking. We present
an unsupervised deep learning framework that derives meaningful representation
from textual features for developing semantic entrainment. We investigate the
model's performance by extracting features using different variations of the
BERT model (DistilBERT and XLM-RoBERTa) and Google's universal sentence encoder
(USE) embeddings on two human-human (HH) corpora (The Fisher Corpus English
Part 1, Columbia games corpus) and one human-machine (HM) corpus (Voice
Assistant Conversation Corpus (VACC)). In addition to semantic features we also
trained DNN-based models utilizing two auditory embeddings (TRIpLet Loss
network (TRILL) vectors, Low-level descriptors (LLD) features) and two units of
analysis (Inter pausal unit and Turn). The results show that semantic
entrainment can be assessed with our model, that models can distinguish between
HH and HM interactions and that the two units of analysis for extracting
acoustic features provide comparable findings.
Related papers
- Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0 [0.11510009152620666]
We study how Wav2Vec2 resolves phonotactic constraints.
We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts.
Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds.
arXiv Detail & Related papers (2024-07-03T11:04:31Z) - Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement.
Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z) - Relationship between auditory and semantic entrainment using Deep Neural
Networks (DNN) [0.0]
This study utilized state-of-the-art embeddings such as BERT and TRIpLet Loss network (TRILL) vectors to extract features for measuring semantic and auditory similarities of turns within dialogues.
We found people's tendency to entrain on semantic features more when compared to auditory features.
The findings of this study might assist in implementing the mechanism of entrainment in human-machine interaction (HMI)
arXiv Detail & Related papers (2023-12-27T14:50:09Z) - CiwaGAN: Articulatory information exchange [15.944474482218334]
Humans encode information into sounds by controlling articulators and decode information from sounds using the auditory apparatus.
This paper introduces CiwaGAN, a model of human spoken language acquisition that combines unsupervised articulatory modeling with an unsupervised model of information exchange through the auditory modality.
arXiv Detail & Related papers (2023-09-14T17:10:39Z) - Agentivit\`a e telicit\`a in GilBERTo: implicazioni cognitive [77.71680953280436]
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics.
The semantic properties considered are telicity (also combined with definiteness) and agentivity.
arXiv Detail & Related papers (2023-07-06T10:52:22Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.