On the Evaluation of Speech Foundation Models for Spoken Language Understanding
- URL: http://arxiv.org/abs/2406.10083v1
- Date: Fri, 14 Jun 2024 14:37:52 GMT
- Title: On the Evaluation of Speech Foundation Models for Spoken Language Understanding
- Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe,
- Abstract summary: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking.
The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks.
We ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs?
- Score: 87.52911510306011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
Related papers
- FMDLlama: Financial Misinformation Detection based on Large Language Models [35.487700542961136]
Large language models (LLMs) have demonstrated outstanding performance in various fields.
We propose FMDLlama, the first open-sourced instruction-following LLMs for FMD task based on fine-tuning Llama3.1 with instruction data.
We compare our models with a variety of LLMs on FMD-B, where our model outperforms all other open-sourced LLMs as well as ChatGPT.
arXiv Detail & Related papers (2024-09-24T20:44:30Z) - Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models [12.034908403186202]
We investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD)
Speaker recognition SFM representations perform the best amongst all the foundation models (FMs)
arXiv Detail & Related papers (2024-09-21T12:50:53Z) - FedPFT: Federated Proxy Fine-Tuning of Foundation Models [55.58899993272904]
Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges as a promising strategy for protecting data privacy and valuable FMs.
Existing methods fine-tune FM by allocating sub-FM to clients in FL, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients.
We propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules.
arXiv Detail & Related papers (2024-04-17T16:30:06Z) - Foundation Model Sherpas: Guiding Foundation Models through Knowledge
and Reasoning [23.763256908202496]
Foundation models (FMs) have revolutionized the field of AI by showing remarkable performance in various tasks.
FMs exhibit numerous limitations that prevent their broader adoption in many real-world systems.
We propose a conceptual framework that encapsulates different modes by which agents could interact with FMs.
arXiv Detail & Related papers (2024-02-02T18:00:35Z) - Speech foundation models on intelligibility prediction for
hearing-impaired listeners [4.742307809368852]
Speech foundation models (SFMs) have been benchmarked on many speech processing tasks.
We present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction.
We propose a simple method that learns a specialized prediction head on top of frozen SFMs to approach the problem.
arXiv Detail & Related papers (2024-01-24T18:26:52Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - VideoGLUE: Video General Understanding Evaluation of Foundation Models [89.07145427268948]
We evaluate video understanding capabilities of foundation models (FMs) using a carefully designed experiment protocol.
We jointly profile FMs' hallmark and efficacy efficiency when adapting to general video understanding tasks.
arXiv Detail & Related papers (2023-07-06T17:47:52Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.