The timing bottleneck: Why timing and overlap are mission-critical for
conversational user interfaces, speech recognition and dialogue systems
- URL: http://arxiv.org/abs/2307.15493v1
- Date: Fri, 28 Jul 2023 11:38:05 GMT
- Title: The timing bottleneck: Why timing and overlap are mission-critical for
conversational user interfaces, speech recognition and dialogue systems
- Authors: Andreas Liesenfeld, Alianda Lopez, Mark Dingemanse
- Abstract summary: We evaluate 5 major commercial ASR systems for their conversational and multilingual support.
We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge.
Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.
- Score: 0.11470070927586018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech recognition systems are a key intermediary in voice-driven
human-computer interaction. Although speech recognition works well for pristine
monologic audio, real-life use cases in open-ended interactive settings still
present many challenges. We argue that timing is mission-critical for dialogue
systems, and evaluate 5 major commercial ASR systems for their conversational
and multilingual support. We find that word error rates for natural
conversational data in 6 languages remain abysmal, and that overlap remains a
key challenge (study 1). This impacts especially the recognition of
conversational words (study 2), and in turn has dire consequences for
downstream intent recognition (study 3). Our findings help to evaluate the
current state of conversational ASR, contribute towards multidimensional error
analysis and evaluation, and identify phenomena that need most attention on the
way to build robust interactive speech technologies.
Related papers
- Are cascade dialogue state tracking models speaking out of turn in
spoken dialogues? [1.786898113631979]
This paper proposes a comprehensive analysis of the errors of state of the art systems in complex settings such as Dialogue State Tracking.
Based on spoken MultiWoz, we identify that errors on non-categorical slots' values are essential to address in order to bridge the gap between spoken and chat-based dialogue systems.
arXiv Detail & Related papers (2023-11-03T08:45:22Z) - Adapting Text-based Dialogue State Tracker for Spoken Dialogues [20.139351605832665]
We describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11.
Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value.
arXiv Detail & Related papers (2023-08-29T06:27:58Z) - PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue
Model [79.64376762489164]
PK-Chat is a Pointer network guided generative dialogue model, incorporating a unified pretrained language model and a pointer network over knowledge graphs.
The words generated by PK-Chat in the dialogue are derived from the prediction of word lists and the direct prediction of the external knowledge graph knowledge.
Based on the PK-Chat, a dialogue system is built for academic scenarios in the case of geosciences.
arXiv Detail & Related papers (2023-04-02T18:23:13Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Evaluation of Automated Speech Recognition Systems for Conversational
Speech: A Linguistic Perspective [0.0]
We take a linguistic perspective, and take the French language as a case study toward disambiguation of the French homophones.
Our contribution aims to provide more insight into human speech transcription accuracy in conditions to reproduce those of state-of-the-art ASR systems.
arXiv Detail & Related papers (2022-11-05T04:35:40Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - Topic-Aware Multi-turn Dialogue Modeling [91.52820664879432]
This paper presents a novel solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way.
Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network.
arXiv Detail & Related papers (2020-09-26T08:43:06Z) - TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented
Dialogue [113.45485470103762]
In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling.
To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling.
arXiv Detail & Related papers (2020-04-15T04:09:05Z) - Recent Advances and Challenges in Task-oriented Dialog System [63.82055978899631]
Task-oriented dialog systems are attracting more and more attention in academic and industrial communities.
We discuss three critical topics for task-oriented dialog systems: (1) improving data efficiency to facilitate dialog modeling in low-resource settings, (2) modeling multi-turn dynamics for dialog policy learning, and (3) integrating domain knowledge into the dialog model.
arXiv Detail & Related papers (2020-03-17T01:34:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.