Related papers: Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching

Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching

URL: http://arxiv.org/abs/2310.02382v1
Date: Tue, 3 Oct 2023 19:05:32 GMT
Title: Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching
Authors: Liming Wang, Mark Hasegawa-Johnson and Chang D. Yoo
Abstract summary: We introduce a novel ASR system, ESPUM. This system harnesses the power of lower-order N-skipgrams (up to N=3) combined with positional unigram statistics gathered from a small batch of samples. Our model showcases competitive performance in ASR and phoneme segmentation tasks.
Score: 67.98016412551245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. This system harnesses the power of lower-order N-skipgrams (up to N=3) combined with positional unigram statistics gathered from a small batch of samples. Evaluated on the TIMIT benchmark, our model showcases competitive performance in ASR and phoneme segmentation tasks. Access our publicly available code at https://github.com/lwang114/GraphUnsupASR.

Related papers

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection [8.303512060791736]
Spoken term detection is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching. We propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms.
arXiv Detail & Related papers (2024-11-21T13:05:18Z)
Unifying Global and Near-Context Biasing in a Single Trie Pass [11.277273712268897]
We propose an unexplored combination of an NE bias list and a word-level n-gram language model (LM)<n>We show that the proposed combination of keyword biasing and n-gram LM improves entity recognition by up to 32% relative and reduces overall WER by up to a 12% relative.
arXiv Detail & Related papers (2024-09-20T13:53:37Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems. We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Streaming Speech-to-Confusion Network Speech Recognition [19.720334657478475]
We present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency. We show that 1-best results of our model are on par with a comparable RNN-T system. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
arXiv Detail & Related papers (2023-06-02T20:28:14Z)
A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts [1.2930503923129213]
We propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram) We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary.
arXiv Detail & Related papers (2022-09-21T15:35:02Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z)
Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly. In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z)
DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.