Related papers: Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

URL: http://arxiv.org/abs/2503.23439v1
Date: Sun, 30 Mar 2025 13:34:23 GMT
Title: Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Authors: Hyunjong Ok, Suho Yoo, Jaeho Lee,
Abstract summary: We introduce the ETD dataset, the first public dataset for end-turn detection.<n>We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments.<n>Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low.
Score: 11.136112399898481
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.

Related papers

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection [15.626772502710867]
We propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach.<n>Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models.<n>Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy.
arXiv Detail & Related papers (2025-03-19T09:32:52Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance. We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information. Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z)
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models [45.94388391693112]
This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data.
arXiv Detail & Related papers (2023-08-14T18:28:04Z)
Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model. The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z)
Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods. In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR. Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.