BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection
- URL: http://arxiv.org/abs/2411.14100v1
- Date: Thu, 21 Nov 2024 13:05:18 GMT
- Title: BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection
- Authors: Anup Singh, Kris Demuynck, Vipul Arora,
- Abstract summary: Spoken term detection is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching.
We propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens.
This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms.
- Score: 8.303512060791736
- License:
- Abstract: Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks. Empirical evaluation on LibriSpeech and TIMIT databases indicates that our method outperforms existing STD baselines while being more efficient.
Related papers
- STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [19.48653924804823]
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers.
However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech.
We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text.
arXiv Detail & Related papers (2024-06-25T22:18:52Z) - DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks.
Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks.
However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - AlloST: Low-resource Speech Translation without Source Transcription [17.53382405899421]
We propose a learning framework that utilizes a language-independent universal phone recognizer.
The framework is based on an attention-based sequence-to-sequence model.
Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline.
arXiv Detail & Related papers (2021-05-01T05:30:18Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.