Related papers: Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts

Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts

URL: http://arxiv.org/abs/2405.13203v1
Date: Tue, 21 May 2024 21:14:31 GMT
Title: Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts
Authors: Garrett Tanzer, Gustaf Ahdritz, Luke Melas-Kyriazi,
Abstract summary: We present a simple yet general method to simulate real-time interactive conversations using pretrained language models. We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations.
Score: 11.067252960486272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chatbots built upon language models have exploded in popularity, but they have largely been limited to synchronous, turn-by-turn dialogues. In this paper we present a simple yet general method to simulate real-time interactive conversations using pretrained text-only language models, by modeling timed diarized transcripts and decoding them with causal rejection sampling. We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations, which require generation at about 30 tok/s and 20 tok/s respectively to maintain real-time interactivity. These capabilities can be added into language models using relatively little data and run on commodity hardware.

Related papers

Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z)
Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z)
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key conversational behaviors. We aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
arXiv Detail & Related papers (2025-03-06T18:59:16Z)
Real-Time Textless Dialogue Generation [23.456302461693053]
We propose a real-time, textless spoken dialogue generation model (RTTL-DG) Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems.
arXiv Detail & Related papers (2025-01-08T23:21:43Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming [0.0]
Mini- Omni is an audio-based end-to-end conversational model capable of real-time speech interaction. We propose a text-instructed speech generation method, along with batch-parallel strategies during inference to boost the performance. We also introduce the VoiceAssistant-400K dataset to fine-tune models for optimized speech output.
arXiv Detail & Related papers (2024-08-29T17:18:53Z)
Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z)
Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z)
Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models. It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z)
An Interleaving Semantics of the Timed Concurrent Language for Argumentation to Model Debates and Dialogue Games [0.0]
We propose a language for modelling concurrent interaction between agents. Such a language exploits a shared memory used by the agents to communicate and reason on the acceptability of their beliefs. We show how it can be used to model interactions such as debates and dialogue games taking place between intelligent agents.
arXiv Detail & Related papers (2023-06-13T10:41:28Z)
Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain. We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z)
CloneBot: Personalized Dialogue-Response Predictions [0.0]
The project task was to create a model that, given a speaker ID, chat history, and an utterance query, can predict the response utterance in a conversation. The model is personalized for each speaker. This task can be a useful tool for building speech bots that talk in a human-like manner in a live conversation.
arXiv Detail & Related papers (2021-03-31T01:15:37Z)
Plug-and-Play Conversational Models [62.77150879036442]
We introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree of control over the generated conversational responses with regard to multiple desired attributes.
arXiv Detail & Related papers (2020-10-09T03:17:51Z)
The Adapter-Bot: All-In-One Controllable Conversational Model [66.48164003532484]
We propose a dialogue model that uses a fixed backbone model such as DialGPT and triggers on-demand dialogue skills via different adapters. Depending on the skills, the model is able to process multiple knowledge types, such as text, tables, and emphatic responses. We evaluate our model using automatic evaluation by comparing it with existing state-of-the-art conversational models.
arXiv Detail & Related papers (2020-08-28T10:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.