Related papers: On the Utility of Self-supervised Models for Prosody-related Tasks

On the Utility of Self-supervised Models for Prosody-related Tasks

URL: http://arxiv.org/abs/2210.07185v1
Date: Thu, 13 Oct 2022 17:06:30 GMT
Title: On the Utility of Self-supervised Models for Prosody-related Tasks
Authors: Guan-Ting Lin, Chi-Luen Feng, Wei-Ping Huang, Yuan Tseng, Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Nigel G. Ward
Abstract summary: Self-Supervised Learning from speech data has produced models that have achieved remarkable performance in many tasks. We present a new evaluation framework, SUPERB-prosody, consisting of three prosody-related downstream tasks and two pseudo tasks. We find that 13 of the 15 SSL models outperformed the baseline on all the prosody-related tasks.
Score: 44.66341483900179
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-Supervised Learning (SSL) from speech data has produced models that have achieved remarkable performance in many tasks, and that are known to implicitly represent many aspects of information latently present in speech signals. However, relatively little is known about the suitability of such models for prosody-related tasks or the extent to which they encode prosodic information. We present a new evaluation framework, SUPERB-prosody, consisting of three prosody-related downstream tasks and two pseudo tasks. We find that 13 of the 15 SSL models outperformed the baseline on all the prosody-related tasks. We also show good performance on two pseudo tasks: prosody reconstruction and future prosody prediction. We further analyze the layerwise contributions of the SSL models. Overall we conclude that SSL speech models are highly effective for prosody-related tasks.

Related papers

Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models [4.243926243206826]
We analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ.<n>We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers.
arXiv Detail & Related papers (2025-05-22T06:58:24Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning [55.265138447400744]
Statement-Tuning is a technique that models discriminative tasks as a set of finite statements and trains an encoder model to discriminate between the potential statements to determine the label. Experimental results demonstrate that Statement-Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters. The study investigates the impact of several design choices on few-shot and zero-shot generalization, revealing that Statement-Tuning can achieve strong performance with modest training data.
arXiv Detail & Related papers (2024-04-19T14:05:03Z)
LLM Augmented LLMs: Expanding Capabilities through Composition [56.40953749310957]
CALM -- Composition to Augment Language Models -- introduces cross-attention between models to compose their representations and enable new capabilities. We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English. When PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40% over the base model for code generation and explanation tasks.
arXiv Detail & Related papers (2024-01-04T18:53:01Z)
Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies [1.2691047660244337]
Self-supervised learning models (SSL models) have been trained using large amounts of unlabeled data in the field of speech processing and music classification. We report the results of experiments comparing SSL models for three different tasks (i.e., singer identification, singing voice transcription, and singing technique classification) as initial exploration and aim to discuss these findings.
arXiv Detail & Related papers (2023-06-22T07:47:18Z)
Explaining, Analyzing, and Probing Representations of Self-Supervised Learning Models for Sensor-based Human Activity Recognition [2.2082422928825136]
Self-supervised learning (SSL) frameworks have been extensively applied to sensor-based Human Activity Recognition (HAR) In this paper, we aim to analyze deep representations of two recent SSL frameworks, namely SimCLR and VICReg.
arXiv Detail & Related papers (2023-04-14T07:53:59Z)
ProsAudit, a prosodic benchmark for self-supervised speech models [14.198508548718676]
ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset.
arXiv Detail & Related papers (2023-02-23T14:30:23Z)
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community. There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition [0.3007949058551534]
Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models.
arXiv Detail & Related papers (2022-06-11T12:43:00Z)
Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data. This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.