T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5
- URL: http://arxiv.org/abs/2211.00586v1
- Date: Tue, 1 Nov 2022 17:00:23 GMT
- Title: T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5
- Authors: Chan-Jan Hsu, Ho-Lam Chung, Hung-yi Lee, Yu Tsao
- Abstract summary: We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
- Score: 65.32642587901903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Spoken language understanding (SLU), a natural solution is concatenating
pre-trained speech models (e.g. HuBERT) and pretrained language models (PLM,
e.g. T5). Most previous works use pretrained language models with subword-based
tokenization. However, the granularity of input units affects the alignment of
speech model outputs and language model inputs, and PLM with character-based
tokenization is underexplored. In this work, we conduct extensive studies on
how PLMs with different tokenization strategies affect spoken language
understanding task including spoken question answering (SQA) and speech
translation (ST). We further extend the idea to create T5lephone(pronounced as
telephone), a variant of T5 that is pretrained using phonemicized text. We
initialize T5lephone with existing PLMs to pretrain it using relatively
lightweight computational resources. We reached state-of-the-art on NMSQA, and
the T5lephone model exceeds T5 with other types of units on end-to-end SQA and
ST.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Sequence to sequence pretraining for a less-resourced Slovenian language [0.0]
We train two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior.
Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are to be considered for the generative tasks.
arXiv Detail & Related papers (2022-07-28T10:08:50Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing [77.4527868307914]
We propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text.
arXiv Detail & Related papers (2021-10-14T07:59:27Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.