ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit
of Kaldi
- URL: http://arxiv.org/abs/2104.01384v1
- Date: Sat, 3 Apr 2021 12:16:19 GMT
- Title: ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit
of Kaldi
- Authors: Yu Wang, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu
Nishizaki
- Abstract summary: This paper describes the "ExKaldi-RT," an online ASR toolkit implemented based on Kaldi and Python language.
ExKaldi-RT provides tools for providing a real-time audio stream pipeline, extracting acoustic features, transmitting packets with a remote connection, estimating acoustic probabilities with a neural network, and online decoding.
- Score: 7.9019242334556745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The availability of open-source software is playing a remarkable role in
automatic speech recognition (ASR). Kaldi, for instance, is widely used to
develop state-of-the-art offline and online ASR systems. This paper describes
the "ExKaldi-RT," online ASR toolkit implemented based on Kaldi and Python
language. ExKaldi-RT provides tools for providing a real-time audio stream
pipeline, extracting acoustic features, transmitting packets with a remote
connection, estimating acoustic probabilities with a neural network, and online
decoding. While similar functions are available built on Kaldi, a key feature
of ExKaldi-RT is completely working on Python language, which has an
easy-to-use interface for online ASR system developers to exploit original
research, for example, by applying neural network-based signal processing and
acoustic model trained with deep learning frameworks. We performed benchmark
experiments on the minimum LibriSpeech corpus, and showed that ExKaldi-RT could
achieve competitive ASR performance in real-time.
Related papers
- Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD [10.168591454648123]
This paper presents a novel end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and VAD into one model.
The proposed system, which we refer to as Long-Running Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate task-specific datasets in the training stage.
In the inference stage, the LR-SR system removes non-speech parts at low computational cost and recognizes speech parts with high robustness.
arXiv Detail & Related papers (2021-03-02T11:49:03Z) - Improving Low Resource Code-switched ASR using Augmented Code-switched
TTS [29.30430160611224]
Building Automatic Speech Recognition systems for code-switched speech has recently gained renewed attention.
End-to-end systems require large amounts of labeled speech.
We report significant improvements in ASR performance achieving absolute word error rate (WER) reductions of up to 5%.
arXiv Detail & Related papers (2020-10-12T09:15:12Z) - PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for
End-to-End ASR [65.20342293605472]
PyChain is an implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called emphchain models in the Kaldi automatic speech recognition (ASR) toolkit.
Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible.
arXiv Detail & Related papers (2020-05-20T02:10:21Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.