Multi-View Attention Transfer for Efficient Speech Enhancement
- URL: http://arxiv.org/abs/2208.10367v1
- Date: Mon, 22 Aug 2022 14:47:47 GMT
- Title: Multi-View Attention Transfer for Efficient Speech Enhancement
- Authors: Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Byung Hoon Lee, Sung Won
Han
- Abstract summary: We propose multi-view attention transfer (MV-AT), a feature-based distillation, to obtain efficient speech enhancement models in the time domain.
Based on the multi-view features extraction model, MV-AT transfers multi-view knowledge of the teacher network to the student network without additional parameters.
- Score: 1.6932706284468382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent deep learning models have achieved high performance in speech
enhancement; however, it is still challenging to obtain a fast and
low-complexity model without significant performance degradation. Previous
knowledge distillation studies on speech enhancement could not solve this
problem because their output distillation methods do not fit the speech
enhancement task in some aspects. In this study, we propose multi-view
attention transfer (MV-AT), a feature-based distillation, to obtain efficient
speech enhancement models in the time domain. Based on the multi-view features
extraction model, MV-AT transfers multi-view knowledge of the teacher network
to the student network without additional parameters. The experimental results
show that the proposed method consistently improved the performance of student
models of various sizes on the Valentini and deep noise suppression (DNS)
datasets. MANNER-S-8.1GF with our proposed method, a lightweight model for
efficient deployment, achieved 15.4x and 4.71x fewer parameters and
floating-point operations (FLOPs), respectively, compared to the baseline model
with similar performance.
Related papers
- EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models [29.57891007810509]
Large Language Models (LLMs) have demonstrated outstanding performance across a variety of natural language processing tasks.
We introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers.
Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15%, training speed by 25%, and reduces the number of parameters by approximately 4%, all while improving zero-shot performance.
arXiv Detail & Related papers (2024-09-22T21:08:37Z) - Pre-training Feature Guided Diffusion Model for Speech Enhancement [37.88469730135598]
Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments.
We introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement.
arXiv Detail & Related papers (2024-06-11T18:22:59Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - UniFL: Improve Stable Diffusion via Unified Feedback Learning [51.18278664629821]
We present UniFL, a unified framework that leverages feedback learning to enhance diffusion models comprehensively.
UniFL incorporates three key components: perceptual feedback learning, which enhances visual quality; decoupled feedback learning, which improves aesthetic appeal; and adversarial feedback learning, which optimize inference speed.
In-depth experiments and extensive user studies validate the superior performance of our proposed method in enhancing both the quality of generated models and their acceleration.
arXiv Detail & Related papers (2024-04-08T15:14:20Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models.
We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network.
We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.