On Compressing Sequences for Self-Supervised Speech Models
- URL: http://arxiv.org/abs/2210.07189v2
- Date: Fri, 14 Oct 2022 15:21:22 GMT
- Title: On Compressing Sequences for Self-Supervised Speech Models
- Authors: Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia,
Hung-yi Lee, Hao Tang
- Abstract summary: We study fixed-length and variable-length subsampling along the time axis in self-supervised learning.
We find that variable-length subsampling performs particularly well under low frame rates.
If we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
- Score: 78.62210521316081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compressing self-supervised models has become increasingly necessary, as
self-supervised models become larger. While previous approaches have primarily
focused on compressing the model size, shortening sequences is also effective
in reducing the computational cost. In this work, we study fixed-length and
variable-length subsampling along the time axis in self-supervised learning. We
explore how individual downstream tasks are sensitive to input frame rates.
Subsampling while training self-supervised models not only improves the overall
performance on downstream tasks under certain frame rates, but also brings
significant speed-up in inference. Variable-length subsampling performs
particularly well under low frame rates. In addition, if we have access to
phonetic boundaries, we find no degradation in performance for an average frame
rate as low as 10 Hz.
Related papers
- Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting [47.58016750718323]
We propose a novel generative self-supervised method called TimeDART.
TimeDART captures both the global sequence dependence and local detail features within time series data.
Our code is publicly available at https://github.com/Melmaphother/TimeDART.
arXiv Detail & Related papers (2024-10-08T06:08:33Z) - DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss.
Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z) - HumMUSS: Human Motion Understanding using State Space Models [6.821961232645209]
We propose a novel attention-free model for human motion understanding building upon recent advancements in state space models.
Our model supports both offline and real-time applications.
For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches.
arXiv Detail & Related papers (2024-04-16T19:59:21Z) - Efficient Video Prediction via Sparsely Conditioned Flow Matching [24.32740918613266]
We introduce a novel generative model for video prediction based on latent flow matching.
We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER.
arXiv Detail & Related papers (2022-11-26T14:18:50Z) - Once-for-All Sequence Compression for Self-Supervised Speech Models [62.60723685118747]
We introduce a once-for-all sequence compression framework for self-supervised speech models.
The framework is evaluated on various tasks, showing marginal degradation compared to the fixed compressing rate variants.
We also explore adaptive compressing rate learning, demonstrating the ability to select task-specific preferred frame periods without needing a grid search.
arXiv Detail & Related papers (2022-11-04T09:19:13Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.