JOIST: A Joint Speech and Text Streaming Model For ASR
- URL: http://arxiv.org/abs/2210.07353v1
- Date: Thu, 13 Oct 2022 20:59:22 GMT
- Title: JOIST: A Joint Speech and Text Streaming Model For ASR
- Authors: Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan
Huo, Zhehuai Chen, Bo Li, Weiran Wang and Trevor Strohman
- Abstract summary: We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
- Score: 63.15848310748753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present JOIST, an algorithm to train a streaming, cascaded, encoder
end-to-end (E2E) model with both speech-text paired inputs, and text-only
unpaired inputs. Unlike previous works, we explore joint training with both
modalities, rather than pre-training and fine-tuning. In addition, we explore
JOIST using a streaming E2E model with an order of magnitude more data, which
are also novelties compared to previous works. Through a series of ablation
studies, we explore different types of text modeling, including how to model
the length of the text sequence and the appropriate text sub-word unit
representation. We find that best text representation for JOIST improves WER
across a variety of search and rare-word test sets by 4-14% relative, compared
to a model not trained with text. In addition, we quantitatively show that
JOIST maintains streaming capabilities, which is important for good user-level
experience.
Related papers
- COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - TextMatcher: Cross-Attentional Neural Network to Compare Image and Text [0.0]
We devise the first machine-learning model specifically designed for this problem.
We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset.
We showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.
arXiv Detail & Related papers (2022-05-11T14:01:12Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa.
Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages.
We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.