Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts
- URL: http://arxiv.org/abs/2405.13203v1
- Date: Tue, 21 May 2024 21:14:31 GMT
- Title: Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts
- Authors: Garrett Tanzer, Gustaf Ahdritz, Luke Melas-Kyriazi,
- Abstract summary: We present a simple yet general method to simulate real-time interactive conversations using pretrained language models.
We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations.
- Score: 11.067252960486272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chatbots built upon language models have exploded in popularity, but they have largely been limited to synchronous, turn-by-turn dialogues. In this paper we present a simple yet general method to simulate real-time interactive conversations using pretrained text-only language models, by modeling timed diarized transcripts and decoding them with causal rejection sampling. We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations, which require generation at about 30 tok/s and 20 tok/s respectively to maintain real-time interactivity. These capabilities can be added into language models using relatively little data and run on commodity hardware.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming [0.0]
Mini- Omni is an audio-based end-to-end conversational model capable of real-time speech interaction.
We propose a text-instructed speech generation method, along with batch-parallel strategies during inference to boost the performance.
We also introduce the VoiceAssistant-400K dataset to fine-tune models for optimized speech output.
arXiv Detail & Related papers (2024-08-29T17:18:53Z) - Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations.
Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time.
This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z) - Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - An Interleaving Semantics of the Timed Concurrent Language for
Argumentation to Model Debates and Dialogue Games [0.0]
We propose a language for modelling concurrent interaction between agents.
Such a language exploits a shared memory used by the agents to communicate and reason on the acceptability of their beliefs.
We show how it can be used to model interactions such as debates and dialogue games taking place between intelligent agents.
arXiv Detail & Related papers (2023-06-13T10:41:28Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - CloneBot: Personalized Dialogue-Response Predictions [0.0]
The project task was to create a model that, given a speaker ID, chat history, and an utterance query, can predict the response utterance in a conversation.
The model is personalized for each speaker. This task can be a useful tool for building speech bots that talk in a human-like manner in a live conversation.
arXiv Detail & Related papers (2021-03-31T01:15:37Z) - Plug-and-Play Conversational Models [62.77150879036442]
We introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model.
We demonstrate, through extensive automatic and human evaluation, a high degree of control over the generated conversational responses with regard to multiple desired attributes.
arXiv Detail & Related papers (2020-10-09T03:17:51Z) - The Adapter-Bot: All-In-One Controllable Conversational Model [66.48164003532484]
We propose a dialogue model that uses a fixed backbone model such as DialGPT and triggers on-demand dialogue skills via different adapters.
Depending on the skills, the model is able to process multiple knowledge types, such as text, tables, and emphatic responses.
We evaluate our model using automatic evaluation by comparing it with existing state-of-the-art conversational models.
arXiv Detail & Related papers (2020-08-28T10:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.