Related papers: SpeechBrain: A General-Purpose Speech Toolkit

SpeechBrain: A General-Purpose Speech Toolkit

URL: http://arxiv.org/abs/2106.04624v1
Date: Tue, 8 Jun 2021 18:22:56 GMT
Title: SpeechBrain: A General-Purpose Speech Toolkit
Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, Fran\c{c}ois Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio
Abstract summary: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies. It achieves competitive or state-of-the-art performance in a wide range of speech benchmarks.
Score: 73.0404642815335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

Related papers

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training [23.330297074014315]
In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.
arXiv Detail & Related papers (2025-03-04T16:34:14Z)
ESPnet-SpeechLM: An Open Speech Language Model Toolkit [98.4525334631522]
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development.
arXiv Detail & Related papers (2025-02-21T05:21:58Z)
Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction. Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z)
Open-Source Conversational AI with SpeechBrain 1.0 [32.96166213935756]
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them.
arXiv Detail & Related papers (2024-06-29T15:20:11Z)
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition [67.08798754009153]
Speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. We propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens.
arXiv Detail & Related papers (2024-01-31T18:06:29Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing [17.128885611538486]
Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. We consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks.
arXiv Detail & Related papers (2023-02-27T11:48:54Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages. We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources. We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z)
ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet. It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.