Investigating Pre-trained Audio Encoders in the Low-Resource Condition
- URL: http://arxiv.org/abs/2305.17733v1
- Date: Sun, 28 May 2023 14:15:19 GMT
- Title: Investigating Pre-trained Audio Encoders in the Low-Resource Condition
- Authors: Hao Yang, Jinming Zhao, Gholamreza Haffari, Ehsan Shareghi
- Abstract summary: We conduct a comprehensive set of experiments using a representative set of 3 state-of-the-art encoders (Wav2vec2, WavLM, Whisper) in the low-resource setting.
We provide various quantitative and qualitative analyses on task performance, convergence speed, and representational properties of the encoders.
- Score: 66.92823764664206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained speech encoders have been central to pushing state-of-the-art
results across various speech understanding and generation tasks. Nonetheless,
the capabilities of these encoders in low-resource settings are yet to be
thoroughly explored. To address this, we conduct a comprehensive set of
experiments using a representative set of 3 state-of-the-art encoders
(Wav2vec2, WavLM, Whisper) in the low-resource setting across 7 speech
understanding and generation tasks. We provide various quantitative and
qualitative analyses on task performance, convergence speed, and
representational properties of the encoders. We observe a connection between
the pre-training protocols of these encoders and the way in which they capture
information in their internal layers. In particular, we observe the Whisper
encoder exhibits the greatest low-resource capabilities on content-driven tasks
in terms of performance and convergence speed.
Related papers
- DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Scheduled Sampling in Vision-Language Pretraining with Decoupled
Encoder-Decoder Network [99.03895740754402]
We propose a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved.
As an alternative, we propose a primary scheduled sampling strategy that mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner.
arXiv Detail & Related papers (2021-01-27T17:36:57Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.