RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech
Recognition
- URL: http://arxiv.org/abs/2305.17782v1
- Date: Sun, 28 May 2023 17:48:48 GMT
- Title: RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech
Recognition
- Authors: Wei Zhou, Eugen Beck, Simon Berger, Ralf Schl\"uter, Hermann Ney
- Abstract summary: We present RASR2, a research-oriented generic S2S decoder implemented in C++.
It offers a strong flexibility/compatibility for various S2S models, language models, label units/topologies and neural network architectures.
It provides efficient decoding for both open- and closed-vocabulary scenarios based on a generalized search framework with rich support for different search modes and settings.
- Score: 43.081758770899235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern public ASR tools usually provide rich support for training various
sequence-to-sequence (S2S) models, but rather simple support for decoding
open-vocabulary scenarios only. For closed-vocabulary scenarios, public tools
supporting lexical-constrained decoding are usually only for classical ASR, or
do not support all S2S models. To eliminate this restriction on research
possibilities such as modeling unit choice, we present RASR2 in this work, a
research-oriented generic S2S decoder implemented in C++. It offers a strong
flexibility/compatibility for various S2S models, language models, label
units/topologies and neural network architectures. It provides efficient
decoding for both open- and closed-vocabulary scenarios based on a generalized
search framework with rich support for different search modes and settings. We
evaluate RASR2 with a wide range of experiments on both switchboard and
Librispeech corpora. Our source code is public online.
Related papers
- SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context.
We evaluate SVTRv2 in both standard and recent challenging benchmarks.
arXiv Detail & Related papers (2024-11-24T14:21:35Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - REST: Retrieval-Based Speculative Decoding [69.06115086237207]
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation.
Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens.
When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation.
arXiv Detail & Related papers (2023-11-14T15:43:47Z) - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Unleashing the True Potential of Sequence-to-Sequence Models for
Sequence Tagging and Structure Parsing [18.441585314765632]
Sequence-to-Sequence (S2S) models have achieved remarkable success on various text generation tasks.
We present a systematic study of S2S modeling using contained decoding on four core tasks.
arXiv Detail & Related papers (2023-02-05T01:37:26Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - A Hierarchical Model for Spoken Language Recognition [29.948719321162883]
Spoken language recognition ( SLR) refers to the automatic process used to determine the language present in a speech sample.
We propose a novel hierarchical approach were two PLDA models are trained, one to generate scores for clusters of highly related languages and a second one to generate scores conditional to each cluster.
We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages.
arXiv Detail & Related papers (2022-01-04T22:10:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.