Towards human-like spoken dialogue generation between AI agents from
written dialogue
- URL: http://arxiv.org/abs/2310.01088v1
- Date: Mon, 2 Oct 2023 11:03:20 GMT
- Title: Towards human-like spoken dialogue generation between AI agents from
written dialogue
- Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada
- Abstract summary: This study proposes CHATS - CHatty Agents Text-to-Speech - a discrete token-based system designed to generate spoken dialogues based on written dialogues.
Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side.
- Score: 8.4989907582951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of large language models (LLMs) has made it possible to generate
natural written dialogues between two agents. However, generating human-like
spoken dialogues from these written dialogues remains challenging. Spoken
dialogues have several unique characteristics: they frequently include
backchannels and laughter, and the smoothness of turn-taking significantly
influences the fluidity of conversation. This study proposes CHATS - CHatty
Agents Text-to-Speech - a discrete token-based system designed to generate
spoken dialogues based on written dialogues. Our system can generate speech for
both the speaker side and the listener side simultaneously, using only the
transcription from the speaker side, which eliminates the need for
transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural
turn-taking; it determines the appropriate duration of silence after each
utterance in the absence of overlap, and it initiates the generation of
overlapping speech based on the phoneme sequence of the next utterance in case
of overlap. Experimental evaluations indicate that CHATS outperforms the
text-to-speech baseline, producing spoken dialogues that are more interactive
and fluid while retaining clarity and intelligibility.
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z) - Adapting Text-based Dialogue State Tracker for Spoken Dialogues [20.139351605832665]
We describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11.
Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value.
arXiv Detail & Related papers (2023-08-29T06:27:58Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - A Benchmark for Understanding and Generating Dialogue between Characters
in Stories [75.29466820496913]
We present the first study to explore whether machines can understand and generate dialogue in stories.
We propose two new tasks including Masked Dialogue Generation and Dialogue Speaker Recognition.
We show the difficulty of the proposed tasks by testing existing models with automatic and manual evaluation on DialStory.
arXiv Detail & Related papers (2022-09-18T10:19:04Z) - Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis
Using Linguistic and Prosodic Contexts of Dialogue History [38.65020349874135]
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model.
Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context.
To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.
arXiv Detail & Related papers (2022-06-16T09:47:25Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.