OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
- URL: http://arxiv.org/abs/2506.00338v1
- Date: Sat, 31 May 2025 01:44:44 GMT
- Title: OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
- Authors: Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe,
- Abstract summary: The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models.<n>This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license.<n>To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages.
- Score: 41.50536035290623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.
Related papers
- On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models [57.97940182536942]
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models.
OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways.
We introduce OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity.
arXiv Detail & Related papers (2024-06-13T16:22:37Z) - Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data [75.7383558074758]
This work presents an Open Whisper-style Speech Model (OWSM)
OWSM reproduces Whisper-style training using an open-source toolkit and publicly available data.
We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.
arXiv Detail & Related papers (2023-09-25T05:01:34Z) - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.