Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular
Subword Units
- URL: http://arxiv.org/abs/2110.04109v1
- Date: Fri, 8 Oct 2021 13:15:58 GMT
- Title: Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular
Subword Units
- Authors: Yosuke Higuchi, Keita Karube, Tetsuji Ogawa, Tetsunori Kobayashi
- Abstract summary: In end-to-end automatic speech recognition, a model is expected to implicitly learn representations suitable for recognizing a word-level sequence.
We propose a hierarchical conditional model that is based on connectionist temporal classification ( CTC)
Experimental results on LibriSpeech-100h, 960h and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model.
- Score: 19.668440671541546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In end-to-end automatic speech recognition (ASR), a model is expected to
implicitly learn representations suitable for recognizing a word-level
sequence. However, the huge abstraction gap between input acoustic signals and
output linguistic tokens makes it challenging for a model to learn the
representations. In this work, to promote the word-level representation
learning in end-to-end ASR, we propose a hierarchical conditional model that is
based on connectionist temporal classification (CTC). Our model is trained by
auxiliary CTC losses applied to intermediate layers, where the vocabulary size
of each target subword sequence is gradually increased as the layer becomes
close to the word-level output. Here, we make each level of sequence prediction
explicitly conditioned on the previous sequences predicted at lower levels.
With the proposed approach, we expect the proposed model to learn the
word-level representations effectively by exploiting a hierarchy of linguistic
structures. Experimental results on LibriSpeech-{100h, 960h} and TEDLIUM2
demonstrate that the proposed model improves over a standard CTC-based model
and other competitive models from prior work. We further analyze the results to
confirm the effectiveness of the intended representation learning with our
model.
Related papers
- Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Word Sense Induction with Hierarchical Clustering and Mutual Information
Maximization [14.997937028599255]
Word sense induction is a difficult problem in natural language processing.
We propose a novel unsupervised method based on hierarchical clustering and invariant information clustering.
We empirically demonstrate that, in certain cases, our approach outperforms prior WSI state-of-the-art methods.
arXiv Detail & Related papers (2022-10-11T13:04:06Z) - Variable-rate hierarchical CPC leads to acoustic unit discovery in
speech [11.641568891561866]
We explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding.
We propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module.
arXiv Detail & Related papers (2022-06-05T16:18:27Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z) - Augmenting BERT-style Models with Predictive Coding to Improve
Discourse-level Representations [20.855686009404703]
We propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn discourse-level representations.
Our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network.
arXiv Detail & Related papers (2021-09-10T00:45:28Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.