Gated Multimodal Fusion with Contrastive Learning for Turn-taking
Prediction in Human-robot Dialogue
- URL: http://arxiv.org/abs/2204.10172v1
- Date: Mon, 18 Apr 2022 05:18:00 GMT
- Title: Gated Multimodal Fusion with Contrastive Learning for Turn-taking
Prediction in Human-robot Dialogue
- Authors: Jiudong Yang, Peiying Wang, Yi Zhu, Mingchao Feng, Meng Chen, Xiaodong
He
- Abstract summary: Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems.
We first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities.
A novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction.
- Score: 15.710861456924158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Turn-taking, aiming to decide when the next speaker can start talking, is an
essential component in building human-robot spoken dialogue systems. Previous
studies indicate that multimodal cues can facilitate this challenging task.
However, due to the paucity of public multimodal datasets, current methods are
mostly limited to either utilizing unimodal features or simplistic multimodal
ensemble models. Besides, the inherent class imbalance in real scenario, e.g.
sentence ending with short pause will be mostly regarded as the end of turn,
also poses great challenge to the turn-taking decision. In this paper, we first
collect a large-scale annotated corpus for turn-taking with over 5,000 real
human-robot dialogues in speech and text modalities. Then, a novel gated
multimodal fusion mechanism is devised to utilize various information
seamlessly for turn-taking prediction. More importantly, to tackle the data
imbalance issue, we design a simple yet effective data augmentation method to
construct negative instances without supervision and apply contrastive learning
to obtain better feature representations. Extensive experiments are conducted
and the results demonstrate the superiority and competitiveness of our model
over several state-of-the-art baselines.
Related papers
- Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Pre-training Multi-party Dialogue Models with Latent Discourse Inference [85.9683181507206]
We pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying.
To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model.
arXiv Detail & Related papers (2023-05-24T14:06:27Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance
for Multi-party Dialogue Reading Comprehension [46.69961067676279]
Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue.
Previous models focus on how to incorporate speaker information flows using complex graph-based modules.
In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows.
arXiv Detail & Related papers (2021-09-08T16:51:41Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Multi-Modal Open-Domain Dialogue [28.69395893943413]
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling.
We investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models.
We show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor.
arXiv Detail & Related papers (2020-10-02T16:20:39Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.