Modeling Speaker-Listener Interaction for Backchannel Prediction
- URL: http://arxiv.org/abs/2304.04472v1
- Date: Mon, 10 Apr 2023 09:22:06 GMT
- Title: Modeling Speaker-Listener Interaction for Backchannel Prediction
- Authors: Daniel Ortega, Sarina Meyer, Antje Schweitzer and Ngoc Thang Vu
- Abstract summary: Backchanneling theories emphasize the active and continuous role of the listener in the course of a conversation.
We propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech.
Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions.
- Score: 24.52345279975304
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present our latest findings on backchannel modeling novelly motivated by
the canonical use of the minimal responses Yeah and Uh-huh in English and their
correspondent tokens in German, and the effect of encoding the speaker-listener
interaction. Backchanneling theories emphasize the active and continuous role
of the listener in the course of the conversation, their effects on the
speaker's subsequent talk, and the consequent dynamic speaker-listener
interaction. Therefore, we propose a neural-based acoustic backchannel
classifier on minimal responses by processing acoustic features from the
speaker speech, capturing and imitating listeners' backchanneling behavior, and
encoding speaker-listener interaction. Our experimental results on the
Switchboard and GECO datasets reveal that in almost all tested scenarios the
speaker or listener behavior embeddings help the model make more accurate
backchannel predictions. More importantly, a proper interaction encoding
strategy, i.e., combining the speaker and listener embeddings, leads to the
best performance on both datasets in terms of F1-score.
Related papers
- Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.
This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR
Transcriptions [30.779582465296897]
We develop a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers.
Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of using listener embeddings to mimic different backchanneling behaviours.
arXiv Detail & Related papers (2023-04-10T09:33:29Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - A Speaker-aware Parallel Hierarchical Attentive Encoder-Decoder Model
for Multi-turn Dialogue Generation [13.820298189734686]
This paper presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations.
Our empirical results show that PHAED outperforms the state-of-the-art in both automatic and human evaluations.
arXiv Detail & Related papers (2021-10-13T16:08:29Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Self-supervised learning for audio-visual speaker diarization [33.87232473483064]
We propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort.
We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8%F1-scoresas well as diarization error rate reduction.
arXiv Detail & Related papers (2020-02-13T02:36:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.