SpeechBrain: A General-Purpose Speech Toolkit
- URL: http://arxiv.org/abs/2106.04624v1
- Date: Tue, 8 Jun 2021 18:22:56 GMT
- Title: SpeechBrain: A General-Purpose Speech Toolkit
- Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe,
Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab
Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng
Liao, Elena Rastorgueva, Fran\c{c}ois Grondin, William Aris, Hwidong Na, Yan
Gao, Renato De Mori, Yoshua Bengio
- Abstract summary: SpeechBrain is an open-source and all-in-one speech toolkit.
It is designed to facilitate the research and development of neural speech processing technologies.
It achieves competitive or state-of-the-art performance in a wide range of speech benchmarks.
- Score: 73.0404642815335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed
to facilitate the research and development of neural speech processing
technologies by being simple, flexible, user-friendly, and well-documented.
This paper describes the core architecture designed to support several tasks of
common interest, allowing users to naturally conceive, compare and share novel
speech processing pipelines. SpeechBrain achieves competitive or
state-of-the-art performance in a wide range of speech benchmarks. It also
provides training recipes, pretrained models, and inference scripts for popular
speech datasets, as well as tutorials which allow anyone with basic Python
proficiency to familiarize themselves with speech technologies.
Related papers
- Open-Source Conversational AI with SpeechBrain 1.0 [32.96166213935756]
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch.
It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them.
arXiv Detail & Related papers (2024-06-29T15:20:11Z) - SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition [67.08798754009153]
Speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.
We propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens.
arXiv Detail & Related papers (2024-01-31T18:06:29Z) - SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network.
It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model.
It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic
Speech Processing [17.128885611538486]
Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses.
We consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing.
SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks.
arXiv Detail & Related papers (2023-02-27T11:48:54Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.