Related papers: Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training

Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training

URL: http://arxiv.org/abs/2312.01515v1
Date: Sun, 3 Dec 2023 22:08:54 GMT
Title: Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training
Authors: Sean Robertson and Ewan Dunbar
Abstract summary: We investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning. We find that phone discriminability in the resulting model representations peaks at around 40ms of preceding context. We find that this pattern also transfers to supervised ASR when the pre-trained representations are used as frozen input features.
Score: 8.130638226288402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It has been generally assumed in the automatic speech recognition (ASR) literature that it is better for models to have access to wider context windows. Yet, many of the potential reasons this might be true in the supervised setting do not necessarily transfer over to the case of unsupervised learning. We investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning. We principally investigate contrastive predictive coding (CPC), which we adapt to be able to precisely control the amount of context visible to the model during training and inference. We find that phone discriminability in the resulting model representations peaks at around 40~ms of preceding context, and that having too much context (beyond around 320 ms) substantially degrades the quality of the representations. Surprisingly, we find that this pattern also transfers to supervised ASR when the pre-trained representations are used as frozen input features. Our results point to potential changes in the design of current upstream architectures to better facilitate a variety of downstream tasks.

Related papers

Context-Informed Grounding Supervision [102.11698329887226]
Context-INformed Grounding Supervision (CINGS) is a post-training supervision in which the model is trained with relevant context prepended to the response.<n>Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains.
arXiv Detail & Related papers (2025-06-18T14:13:56Z)
Contextures: The Mechanism of Representation Learning [3.7257289916860152]
dissertation establishes the contexture theory to mathematically characterize the mechanism of representation learning, or pretraining. The central argument is that a representation is learned from the association between the input X and a context variable A. We demonstrate that many pretraining objectives can learn the contexture, including supervised learning, self-supervised learning, generative models, etc.
arXiv Detail & Related papers (2025-04-28T13:36:28Z)
On the Loss of Context-awareness in General Instruction Fine-tuning [101.03941308894191]
We investigate the loss of context awareness after supervised fine-tuning. We find that the performance decline is associated with a bias toward different roles learned during conversational instruction fine-tuning. We propose a metric to identify context-dependent examples from general instruction fine-tuning datasets.
arXiv Detail & Related papers (2024-11-05T00:16:01Z)
Revisiting Prompt Pretraining of Vision-Language Models [13.888505919946578]
We propose a general framework termed Revisiting Prompt Pretraining (RPP) RPP targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. We additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model.
arXiv Detail & Related papers (2024-09-10T02:36:13Z)
In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness [43.70647711168682]
We show the role of softmax attention in an ICL setting where each context encodes a regression task. We show that an attention unit learns a window that it uses to implement a nearest-neighbors predictor adapted to the landscape of the pretraining tasks. We also show that on low-rank, linear problems, the attention unit learns to project onto the appropriate subspace before inference.
arXiv Detail & Related papers (2024-02-18T16:37:32Z)
Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets. In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset. We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z)
Knowledge Distillation for Quality Estimation [79.51452598302934]
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
arXiv Detail & Related papers (2021-07-01T12:36:21Z)
Do Adversarially Robust ImageNet Models Transfer Better? [102.09335596483695]
adversarially robust models often perform better than their standard-trained counterparts when used for transfer learning. Our results are consistent with (and in fact, add to) recent hypotheses stating that robustness leads to improved feature representations.
arXiv Detail & Related papers (2020-07-16T17:42:40Z)
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration. We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech. TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.