Related papers: On Compressing Sequences for Self-Supervised Speech Models

On Compressing Sequences for Self-Supervised Speech Models

URL: http://arxiv.org/abs/2210.07189v2
Date: Fri, 14 Oct 2022 15:21:22 GMT
Title: On Compressing Sequences for Self-Supervised Speech Models
Authors: Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-yi Lee, Hao Tang
Abstract summary: We study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We find that variable-length subsampling performs particularly well under low frame rates. If we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
Score: 78.62210521316081
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.

Related papers

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
AverageTime: Enhance Long-Term Time Series Forecasting with Simple Averaging [6.125620036017928]
Long-term time series forecasting focuses on leveraging historical data to predict future trends. The core challenge lies in effectively modeling dependencies both within sequences and channels. Our research proposes a new approach for capturing sequence and channel dependencies: AverageTime.
arXiv Detail & Related papers (2024-12-30T05:56:25Z)
Efficient Continuous Video Flow Model for Video Prediction [43.16308241800144]
Multi-step prediction models, such as diffusion and rectified flow models, exhibit higher latency in sampling new frames compared to single-step methods. We propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks.
arXiv Detail & Related papers (2024-12-07T12:11:25Z)
Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting [47.58016750718323]
We propose a novel generative self-supervised method called TimeDART. TimeDART captures both the global sequence dependence and local detail features within time series data. Our code is publicly available at https://github.com/Melmaphother/TimeDART.
arXiv Detail & Related papers (2024-10-08T06:08:33Z)
DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z)
HumMUSS: Human Motion Understanding using State Space Models [6.821961232645209]
We propose a novel attention-free model for human motion understanding building upon recent advancements in state space models. Our model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches.
arXiv Detail & Related papers (2024-04-16T19:59:21Z)
Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z)
Efficient Video Prediction via Sparsely Conditioned Flow Matching [24.32740918613266]
We introduce a novel generative model for video prediction based on latent flow matching. We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER.
arXiv Detail & Related papers (2022-11-26T14:18:50Z)
Once-for-All Sequence Compression for Self-Supervised Speech Models [62.60723685118747]
We introduce a once-for-all sequence compression framework for self-supervised speech models. The framework is evaluated on various tasks, showing marginal degradation compared to the fixed compressing rate variants. We also explore adaptive compressing rate learning, demonstrating the ability to select task-specific preferred frame periods without needing a grid search.
arXiv Detail & Related papers (2022-11-04T09:19:13Z)
Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead. We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z)
Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process. We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.