On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
- URL: http://arxiv.org/abs/2406.09282v1
- Date: Thu, 13 Jun 2024 16:22:37 GMT
- Title: On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
- Authors: Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe,
- Abstract summary: The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models.
OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways.
We introduce OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity.
- Score: 57.97940182536942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
Related papers
- Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation [43.479279052047985]
We conduct a preliminary study to understand the importance of modeling other information using separate learnable parameters.
Our findings are twofold: first, the O-HuBERT method is able to utilize all layers to build complex features to encode other information; second, a robust data augmentation strategy is essential for learning the information required by tasks that depend on other information.
arXiv Detail & Related papers (2024-08-20T05:45:04Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer [67.75820725013372]
The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits.
We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters.
OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%.
arXiv Detail & Related papers (2024-01-30T01:22:18Z) - Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation [6.273933281069326]
We investigate three text diversity incentive methods well established in crowdsourcing: taboo words, hints by previous outlier solutions, and chaining on previous outlier solutions.
We show that diversity is most increased by taboo words, but downstream model performance is highest with hints.
arXiv Detail & Related papers (2024-01-12T15:46:43Z) - Dynamics of Instruction Tuning: Each Ability of Large Language Models
Has Its Own Growth Pace [21.015261553612643]
We present a dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters.
Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors.
Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases.
arXiv Detail & Related papers (2023-10-30T15:37:10Z) - Pre-trained Language Models for Keyphrase Generation: A Thorough
Empirical Study [76.52997424694767]
We present an in-depth empirical study of keyphrase extraction and keyphrase generation using pre-trained language models.
We show that PLMs have competitive high-resource performance and state-of-the-art low-resource performance.
Further results show that in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models.
arXiv Detail & Related papers (2022-12-20T13:20:21Z) - Exploring the State-of-the-Art Language Modeling Methods and Data
Augmentation Techniques for Multilingual Clause-Level Morphology [3.8498574327875947]
We present our work on all three parts of the shared task: inflection, reinflection, and analysis.
We mainly explore two approaches: Transformer models in combination with data augmentation, and exploiting the state-of-the-art language modeling techniques for morphological analysis.
Our methods achieved first place in each of the three tasks and outperforms mT5-baseline with 89% for inflection, 80% for reinflection and 12% for analysis.
arXiv Detail & Related papers (2022-11-03T11:53:39Z) - An Empirical Study on Distribution Shift Robustness From the Perspective
of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation.
We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.