Related papers: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

URL: http://arxiv.org/abs/2406.07867v2
Date: Fri, 2 Aug 2024 15:05:47 GMT
Title: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
Authors: Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro,
Abstract summary: We introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response. We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
Score: 55.043492250775294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.

Related papers

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching [22.477986192421767]
We introduce ZipVoice-Dialog, a non-autoregressive spoken dialogue generation model built upon flow matching.<n>Key designs include speaker-turn embeddings for precise speaker turn-taking.<n>We curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data.
arXiv Detail & Related papers (2025-07-12T15:18:47Z)
DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling [73.08187964426823]
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction. This paper introduces a new research task--$textbfD$ialogue $textbfE$lement $textbfMO$deling. We propose a novel benchmark, $textbfDEMO$, designed for a comprehensive dialogue modeling and assessment.
arXiv Detail & Related papers (2024-12-06T10:01:38Z)
WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective. We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way. We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z)
ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human [76.62897301298699]
ChatPLUG is a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format. We show that modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation. We deploy modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference.
arXiv Detail & Related papers (2023-04-16T18:16:35Z)
"How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations. We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling. Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z)
DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization [19.918194137007653]
We present a pre-training framework for long dialogue understanding and summarization. Considering the nature of long conversations, we propose a window-based denoising approach for generative pre-training. We conduct extensive experiments on five datasets of long dialogues, covering tasks of dialogue summarization, abstractive question answering and topic segmentation.
arXiv Detail & Related papers (2021-09-06T13:55:03Z)
MMChat: Multi-Modal Chat Dataset on Social Media [8.904627457711683]
MMChat is a large scale multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues) Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media. We develop a benchmark model to address this issue in dialogue generation tasks by adapting the attention routing mechanism on image features.
arXiv Detail & Related papers (2021-08-16T15:27:49Z)
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset. OpenViDial contains a total number of 1.1 million dialogue turns. We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z)
Interview: A Large-Scale Open-Source Corpus of Media Dialog [11.28504775964698]
We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts. Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance. 'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems.
arXiv Detail & Related papers (2020-04-07T02:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.