How Redundant Is the Transformer Stack in Speech Representation Models?
- URL: http://arxiv.org/abs/2409.16302v1
- Date: Tue, 10 Sep 2024 11:00:24 GMT
- Title: How Redundant Is the Transformer Stack in Speech Representation Models?
- Authors: Teresa Dorszewski, Albert Kj{\o}ller Jacobsen, Lenka T\v{e}tkov\'a,
Lars Kai Hansen
- Abstract summary: Self-supervised speech representation models have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection.
Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning.
We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training.
- Score: 1.3873323883842132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised speech representation models, particularly those leveraging
transformer architectures, have demonstrated remarkable performance across
various tasks such as speech recognition, speaker identification, and emotion
detection. Recent studies on transformer models revealed a high redundancy
between layers and the potential for significant pruning, which we will
investigate here for transformer-based speech representation models. We perform
a detailed analysis of layer similarity in speech representation models using
three similarity metrics: cosine similarity, centered kernel alignment, and
mutual nearest-neighbor alignment. Our findings reveal a block-like structure
of high similarity, suggesting two main processing steps and significant
redundancy of layers. We demonstrate the effectiveness of pruning
transformer-based speech representation models without the need for
post-training, achieving up to 40% reduction in transformer layers while
maintaining over 95% of the model's predictive capacity. Furthermore, we employ
a knowledge distillation method to substitute the entire transformer stack with
mimicking layers, reducing the network size 95-98% and the inference time by up
to 94%. This substantial decrease in computational load occurs without
considerable performance loss, suggesting that the transformer stack is almost
completely redundant for downstream applications of speech representation
models.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Convexity-based Pruning of Speech Representation Models [1.3873323883842132]
Recent work has shown that there is significant redundancy in the transformer models for NLP.
In this paper, we investigate layer pruning in audio models.
We find a massive reduction in the computational effort with no loss of performance or even improvements in certain cases.
arXiv Detail & Related papers (2024-08-16T09:04:54Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z) - Multi-scale Transformer Language Models [30.201934597815583]
We investigate multi-scale transformer language models that learn representations of text at multiple scales.
We present three different architectures that have an inductive bias to handle the hierarchical nature of language.
arXiv Detail & Related papers (2020-05-01T19:58:56Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z) - Hierarchical Transformer Network for Utterance-level Emotion Recognition [0.0]
We address some challenges in utter-ance-level emotion recognition (ULER)
Unlike the traditional text classification problem, this task is supported by a limited number of datasets.
We use a pretrained language model bidirectional encoder representa-tions from transformers (BERT) as the lower-level transformer.
In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the in-teraction between speakers.
arXiv Detail & Related papers (2020-02-18T13:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.