ASR-Generated Text for Language Model Pre-training Applied to Speech
Tasks
- URL: http://arxiv.org/abs/2207.01893v1
- Date: Tue, 5 Jul 2022 08:47:51 GMT
- Title: ASR-Generated Text for Language Model Pre-training Applied to Speech
Tasks
- Authors: Valentin Pelloin, Franck Dary, Nicolas Herve, Benoit Favre, Nathalie
Camelin, Antoine Laurent, Laurent Besacier
- Abstract summary: We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows.
New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing.
- Score: 20.83731188652985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We aim at improving spoken language modeling (LM) using very large amount of
automatically transcribed speech. We leverage the INA (French National
Audiovisual Institute) collection and obtain 19GB of text after applying ASR on
350,000 hours of diverse TV shows. From this, spoken language models are
trained either by fine-tuning an existing LM (FlauBERT) or through training a
LM from scratch. New models (FlauBERT-Oral) are shared with the community and
evaluated for 3 downstream tasks: spoken language understanding, classification
of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can
be beneficial compared to its initial FlauBERT version demonstrating that,
despite its inherent noisy nature, ASR-generated text can be used to build
spoken language models.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Speech Recognition Rescoring with Large Speech-Text Foundation Models [20.145389016219106]
Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data.
Automatic speech recognition (ASR) systems are often limited by available transcribed speech data.
Recent multi-modal large language models have demonstrated strong spoken language understanding.
arXiv Detail & Related papers (2024-09-25T06:17:23Z) - SLM: Bridge the thin gap between speech and text foundation models [45.319071954143325]
Speech and Language Model (SLM) is a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models.
We show that SLM is efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
arXiv Detail & Related papers (2023-09-30T02:27:45Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Text-Free Prosody-Aware Generative Spoken Language Modeling [46.19240899818964]
We present a prosody-aware generative spoken language model (pGSLM)
It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
arXiv Detail & Related papers (2021-09-07T18:03:21Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.