When can I Speak? Predicting initiation points for spoken dialogue
agents
- URL: http://arxiv.org/abs/2208.03812v1
- Date: Sun, 7 Aug 2022 20:58:52 GMT
- Title: When can I Speak? Predicting initiation points for spoken dialogue
agents
- Authors: Siyan Li, Ashwin Paranjape, Christopher D. Manning
- Abstract summary: We predict the lead-time to initiation using prosodic features from a pre-trained speech representation model.
We train and evaluate the models on the Switchboard Corpus and find that our method vastly outperforms the common approach of waiting for 700ms of silence.
- Score: 41.64197357473437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current spoken dialogue systems initiate their turns after a long period of
silence (700-1000ms), which leads to little real-time feedback, sluggish
responses, and an overall stilted conversational flow. Humans typically respond
within 200ms and successfully predicting initiation points in advance would
allow spoken dialogue agents to do the same. In this work, we predict the
lead-time to initiation using prosodic features from a pre-trained speech
representation model (wav2vec 1.0) operating on user audio and word features
from a pre-trained language model (GPT-2) operating on incremental
transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and
true lead times. We train and evaluate the models on the Switchboard Corpus and
find that our method outperforms features from prior work on both metrics and
vastly outperforms the common approach of waiting for 700ms of silence.
Related papers
- How Did We Get Here? Summarizing Conversation Dynamics [4.644319899528183]
We introduce the task of summarizing the dynamics of conversations by constructing a dataset of human-written summaries.
We evaluate whether such summaries can capture the trajectory of conversations via an established downstream task.
We show that they help both humans and automated systems with this forecasting task.
arXiv Detail & Related papers (2024-04-29T18:00:03Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Leveraging Implicit Feedback from Deployment Data in Dialogue [83.02878726357523]
We study improving social conversational agents by learning from natural dialogue between users and a deployed model.
We leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes.
arXiv Detail & Related papers (2023-07-26T11:34:53Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - Turn-Taking Prediction for Natural Conversational Speech [40.189938418201656]
A common conversational utterance often involves multiple queries with turn-taking.
Disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases.
We present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer.
arXiv Detail & Related papers (2022-08-29T01:09:23Z) - Calibrate your listeners! Robust communication-based training for
pragmatic speakers [30.731870275051957]
We propose a method that uses a population of neural listeners to regularize speaker training.
We show that language drift originates from the poor uncertainty calibration of a neural listener.
We evaluate both population-based objectives on reference games, and show that the ensemble method with better calibration enables the speaker to generate pragmatic utterances.
arXiv Detail & Related papers (2021-10-11T17:07:38Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - CloneBot: Personalized Dialogue-Response Predictions [0.0]
The project task was to create a model that, given a speaker ID, chat history, and an utterance query, can predict the response utterance in a conversation.
The model is personalized for each speaker. This task can be a useful tool for building speech bots that talk in a human-like manner in a live conversation.
arXiv Detail & Related papers (2021-03-31T01:15:37Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.