Turn-Taking Prediction for Natural Conversational Speech
- URL: http://arxiv.org/abs/2208.13321v1
- Date: Mon, 29 Aug 2022 01:09:23 GMT
- Title: Turn-Taking Prediction for Natural Conversational Speech
- Authors: Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman,
Qiao Liang, Yanzhang He
- Abstract summary: A common conversational utterance often involves multiple queries with turn-taking.
Disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases.
We present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer.
- Score: 40.189938418201656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While a streaming voice assistant system has been used in many applications,
this system typically focuses on unnatural, one-shot interactions assuming
input from a single voice query without hesitation or disfluency. However, a
common conversational utterance often involves multiple queries with
turn-taking, in addition to disfluencies. These disfluencies include pausing to
think, hesitations, word lengthening, filled pauses and repeated phrases. This
makes doing speech recognition with conversational speech, including one with
multiple queries, a challenging task. To better model the conversational
interaction, it is critical to discriminate disfluencies and end of query in
order to allow the user to hold the floor for disfluencies while having the
system respond as quickly as possible when the user has finished speaking. In
this paper, we present a turntaking predictor built on top of the end-to-end
(E2E) speech recognizer. Our best system is obtained by jointly optimizing for
ASR task and detecting when the user is paused to think or finished speaking.
The proposed approach demonstrates over 97% recall rate and 85% precision rate
on predicting true turn-taking with only 100 ms latency on a test set designed
with 4 types of disfluencies inserted in conversational utterances.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models [1.4199474167684119]
We introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model.
This model significantly outperforms other known best models by achieving an F1 of 69.27.
arXiv Detail & Related papers (2024-04-11T23:09:18Z) - The timing bottleneck: Why timing and overlap are mission-critical for
conversational user interfaces, speech recognition and dialogue systems [0.11470070927586018]
We evaluate 5 major commercial ASR systems for their conversational and multilingual support.
We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge.
Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.
arXiv Detail & Related papers (2023-07-28T11:38:05Z) - Question-Interlocutor Scope Realized Graph Modeling over Key Utterances
for Dialogue Reading Comprehension [61.55950233402972]
We propose a new key utterances extracting method for dialogue reading comprehension.
It performs prediction on the unit formed by several contiguous utterances, which can realize more answer-contained utterances.
As a graph constructed on the text of utterances, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling.
arXiv Detail & Related papers (2022-10-26T04:00:42Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - Hierarchical Summarization for Longform Spoken Dialog [1.995792341399967]
Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor.
Compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure.
We propose a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges.
arXiv Detail & Related papers (2021-08-21T23:31:31Z) - Analysis and Tuning of a Voice Assistant System for Dysfluent Speech [7.233685721929227]
Speech recognition systems do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks.
We show that by tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders.
arXiv Detail & Related papers (2021-06-18T20:58:34Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.