Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data
- URL: http://arxiv.org/abs/2309.13876v3
- Date: Tue, 24 Oct 2023 18:28:22 GMT
- Title: Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data
- Authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang,
Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou
Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji
Watanabe
- Abstract summary: This work presents an Open Whisper-style Speech Model (OWSM)
OWSM reproduces Whisper-style training using an open-source toolkit and publicly available data.
We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.
- Score: 75.7383558074758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training speech models on large volumes of data has achieved remarkable
success. OpenAI Whisper is a multilingual multitask model trained on 680k hours
of supervised speech data. It generalizes well to various speech recognition
and translation benchmarks even in a zero-shot setup. However, the full
pipeline for developing such models (from data collection to training) is not
publicly accessible, which makes it difficult for researchers to further
improve its performance and address training-related issues such as efficiency,
robustness, fairness, and bias. This work presents an Open Whisper-style Speech
Model (OWSM), which reproduces Whisper-style training using an open-source
toolkit and publicly available data. OWSM even supports more translation
directions and can be more efficient to train. We will publicly release all
scripts used for data preparation, training, inference, and scoring as well as
pre-trained models and training logs to promote open science.
Related papers
- Less is More: Accurate Speech Recognition & Translation without Web-Scale Data [26.461185681285745]
Canary is a multilingual ASR and speech translation model.
It outperforms Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages.
arXiv Detail & Related papers (2024-06-28T06:22:23Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Efficient Speech Translation with Pre-trained Models [13.107314023500349]
We investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models.
While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data.
arXiv Detail & Related papers (2022-11-09T15:07:06Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - Training Data Leakage Analysis in Language Models [6.843491191969066]
We introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model.
We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data.
arXiv Detail & Related papers (2021-01-14T00:57:32Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.