When can I Speak? Predicting initiation points for spoken dialogue
agents
- URL: http://arxiv.org/abs/2208.03812v1
- Date: Sun, 7 Aug 2022 20:58:52 GMT
- Title: When can I Speak? Predicting initiation points for spoken dialogue
agents
- Authors: Siyan Li, Ashwin Paranjape, Christopher D. Manning
- Abstract summary: We predict the lead-time to initiation using prosodic features from a pre-trained speech representation model.
We train and evaluate the models on the Switchboard Corpus and find that our method vastly outperforms the common approach of waiting for 700ms of silence.
- Score: 41.64197357473437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current spoken dialogue systems initiate their turns after a long period of
silence (700-1000ms), which leads to little real-time feedback, sluggish
responses, and an overall stilted conversational flow. Humans typically respond
within 200ms and successfully predicting initiation points in advance would
allow spoken dialogue agents to do the same. In this work, we predict the
lead-time to initiation using prosodic features from a pre-trained speech
representation model (wav2vec 1.0) operating on user audio and word features
from a pre-trained language model (GPT-2) operating on incremental
transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and
true lead times. We train and evaluate the models on the Switchboard Corpus and
find that our method outperforms features from prior work on both metrics and
vastly outperforms the common approach of waiting for 700ms of silence.
Related papers
- Chain-of-Thought Training for Open E2E Spoken Dialogue Systems [57.77235760292348]
End-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information.<n>We propose a chain-of-thought (CoT) formulation to ensure that training on conversational data remains closely aligned with the multimodal language model.<n>Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets.
arXiv Detail & Related papers (2025-05-31T21:43:37Z) - PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs [36.18860434920165]
Personality-aware conversation agents are underexplored due to the absence of personality annotations in speech datasets.<n>We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels.<n>We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations.
arXiv Detail & Related papers (2025-05-20T13:41:32Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key conversational behaviors.
We aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics [54.03209351287654]
We propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities.
We present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events.
We will open source our evaluation platform to promote the development of advanced conversational AI systems.
arXiv Detail & Related papers (2025-03-03T04:46:04Z) - Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.
This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Leveraging Implicit Feedback from Deployment Data in Dialogue [83.02878726357523]
We study improving social conversational agents by learning from natural dialogue between users and a deployed model.
We leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes.
arXiv Detail & Related papers (2023-07-26T11:34:53Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - Turn-Taking Prediction for Natural Conversational Speech [40.189938418201656]
A common conversational utterance often involves multiple queries with turn-taking.
Disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases.
We present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer.
arXiv Detail & Related papers (2022-08-29T01:09:23Z) - CloneBot: Personalized Dialogue-Response Predictions [0.0]
The project task was to create a model that, given a speaker ID, chat history, and an utterance query, can predict the response utterance in a conversation.
The model is personalized for each speaker. This task can be a useful tool for building speech bots that talk in a human-like manner in a live conversation.
arXiv Detail & Related papers (2021-03-31T01:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.