textless-lib: a Library for Textless Spoken Language Processing
- URL: http://arxiv.org/abs/2202.07359v1
- Date: Tue, 15 Feb 2022 12:39:42 GMT
- Title: textless-lib: a Library for Textless Spoken Language Processing
- Authors: Eugene Kharitonov and Jade Copet and Kushal Lakhotia and Tu Anh Nguyen
and Paden Tomasello and Ann Lee and Ali Elkahky and Wei-Ning Hsu and
Abdelrahman Mohamed and Emmanuel Dupoux and Yossi Adi
- Abstract summary: We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area.
We describe the building blocks that the library provides and demonstrate its usability.
- Score: 50.070693765984075
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Textless spoken language processing research aims to extend the applicability
of standard NLP toolset onto spoken language and languages with few or no
textual resources. In this paper, we introduce textless-lib, a PyTorch-based
library aimed to facilitate research in this research area. We describe the
building blocks that the library provides and demonstrate its usability by
discuss three different use-case examples: (i) speaker probing, (ii) speech
resynthesis and compression, and (iii) speech continuation. We believe that
textless-lib substantially simplifies research the textless setting and will be
handful not only for speech researchers but also for the NLP community at
large. The code, documentation, and pre-trained models are available at
https://github.com/facebookresearch/textlesslib/ .
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - Most Language Models can be Poets too: An AI Writing Assistant and
Constrained Text Generation Studio [0.5097809301149341]
We find that most language models generate compelling text even under significant constraints.
We present a technique for modifying the output of a language model by compositionally applying filter functions to the language models vocabulary.
We also present a Huggingface space web-app presenting this technique called Gadsby.
arXiv Detail & Related papers (2023-06-28T05:10:51Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - NeMo Toolbox for Speech Dataset Construction [11.494290433050624]
We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering.
We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks.
arXiv Detail & Related papers (2021-04-11T01:57:55Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - Contextualized Spoken Word Representations from Convolutional
Autoencoders [2.28438857884398]
This paper proposes a Convolutional Autoencoder based neural architecture to model syntactically and semantically adequate contextualized representations of varying length spoken words.
The proposed model was able to demonstrate its robustness when compared to the other two language-based models.
arXiv Detail & Related papers (2020-07-06T16:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.