Related papers: Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

URL: http://arxiv.org/abs/2204.10172v1
Date: Mon, 18 Apr 2022 05:18:00 GMT
Title: Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue
Authors: Jiudong Yang, Peiying Wang, Yi Zhu, Mingchao Feng, Meng Chen, Xiaodong He
Abstract summary: Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. We first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. A novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction.
Score: 15.710861456924158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. Previous studies indicate that multimodal cues can facilitate this challenging task. However, due to the paucity of public multimodal datasets, current methods are mostly limited to either utilizing unimodal features or simplistic multimodal ensemble models. Besides, the inherent class imbalance in real scenario, e.g. sentence ending with short pause will be mostly regarded as the end of turn, also poses great challenge to the turn-taking decision. In this paper, we first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. Then, a novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction. More importantly, to tackle the data imbalance issue, we design a simple yet effective data augmentation method to construct negative instances without supervision and apply contrastive learning to obtain better feature representations. Extensive experiments are conducted and the results demonstrate the superiority and competitiveness of our model over several state-of-the-art baselines.

Related papers

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games [70.37904949359938]
We evaluate language models in multi-turn interactions using a suite of collaborative games that require effective communication about private information.<n>We find that language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario.<n>We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence.
arXiv Detail & Related papers (2026-02-27T17:13:20Z)
MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation [23.343080324521434]
We introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner.<n>Our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations.<n>To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources.
arXiv Detail & Related papers (2025-08-26T14:00:16Z)
DiscussLLM: Teaching Large Language Models When to Speak [9.441455921296301]
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text.<n>We introduce $textitDiscussLLM$, a framework designed to bridge this gap by training models to proactively decide not just $textitwhat$ to say, but critically, $textitwhen$ to speak.
arXiv Detail & Related papers (2025-08-25T16:16:42Z)
Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z)
Quantifying Cross-Modality Memorization in Vision-Language Models [86.82366725590508]
We study the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models.<n>Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities.
arXiv Detail & Related papers (2025-06-05T16:10:47Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout. DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Pre-training Multi-party Dialogue Models with Latent Discourse Inference [85.9683181507206]
We pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model.
arXiv Detail & Related papers (2023-05-24T14:06:27Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension [46.69961067676279]
Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue. Previous models focus on how to incorporate speaker information flows using complex graph-based modules. In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows.
arXiv Detail & Related papers (2021-09-08T16:51:41Z)
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
Multi-Modal Open-Domain Dialogue [28.69395893943413]
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling. We investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models. We show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor.
arXiv Detail & Related papers (2020-10-02T16:20:39Z)
Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely. We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.