Related papers: Towards a Japanese Full-duplex Spoken Dialogue System

Towards a Japanese Full-duplex Spoken Dialogue System

URL: http://arxiv.org/abs/2506.02979v1
Date: Tue, 03 Jun 2025 15:16:50 GMT
Title: Towards a Japanese Full-duplex Spoken Dialogue System
Authors: Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka,
Abstract summary: Full spoken dialogue systems have attracted significant attention recently.<n>In this paper we present first publicly available full-stage spoken dialogue model in Japanese.<n>Our model is trained through two-channel process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data.
Score: 8.984488716637655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first publicly available full-duplex spoken dialogue model in Japanese, which is built upon Moshi, a full-duplex dialogue model in English. Our model is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the model's performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system. Evaluation experiments demonstrate that the trained model outperforms Japanese baseline models in both naturalness and meaningfulness.

Related papers

WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z)
Building a Taiwanese Mandarin Spoken Language Model: A First Attempt [44.54200115439157]
This report aims to build a large spoken language model (MLL) for Taiwanese Mandarin tailored to enable realtime speech interaction in multi-turn conversations.<n>Our end-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving conversational fluency flow.
arXiv Detail & Related papers (2024-11-11T16:37:40Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model [47.67067056593085]
We develop a pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models.
arXiv Detail & Related papers (2024-07-02T03:22:41Z)
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response. We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z)
Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model [8.180382743037082]
This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously.
arXiv Detail & Related papers (2023-09-20T01:48:27Z)
Back to the Future: Bidirectional Information Decoupling Network for Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder. BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks. Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z)
EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training [40.85554509137999]
We propose EVA, a Chinese dialogue system that contains the largest Chinese pre-trained dialogue model with 2.8B parameters. To build this model, we collect the largest Chinese dialogue dataset named WDC-Dialogue from various public social media. Experiments on automatic and human evaluation show that EVA outperforms other Chinese pre-trained dialogue models.
arXiv Detail & Related papers (2021-08-03T14:55:24Z)
Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely. We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.