Related papers: Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

URL: http://arxiv.org/abs/2407.09886v2
Date: Mon, 23 Sep 2024 16:45:04 GMT
Title: Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
Authors: Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee,
Abstract summary: Speech-Copilot is a modular framework for instruction-oriented speech-processing tasks. It builds speech processing-specific toolsets by analyzing pre-collected task instructions. It features a flexible agent based on large language models that performs tasks through program generation.
Score: 42.55462692822432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

Related papers

ESPnet-SpeechLM: An Open Speech Language Model Toolkit [98.4525334631522]
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development.
arXiv Detail & Related papers (2025-02-21T05:21:58Z)
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing. We reformulate speech processing tasks into speech-to-unit generation tasks. We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z)
An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks [3.015760169663536]
We investigate the potential of adapter-based fine-tuning in developing a unified model capable of handling multiple spoken language processing tasks. We show that adapter-based fine-tuning enables a single encoder-decoder model to perform multiple speech processing tasks with an average improvement of 18.4%.
arXiv Detail & Related papers (2024-06-20T21:39:04Z)
SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework. It combines pre-trained speech and text foundation models via a small set of learnable parameters. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z)
WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z)
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition [67.08798754009153]
Speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. We propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens.
arXiv Detail & Related papers (2024-01-31T18:06:29Z)
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z)
Context-Aware Language Modeling for Goal-Oriented Dialogue Systems [84.65707332816353]
We formulate goal-oriented dialogue as a partially observed Markov decision process. We derive a simple and effective method to finetune language models in a goal-aware way. We evaluate our method on a practical flight-booking task using AirDialogue.
arXiv Detail & Related papers (2022-04-18T17:23:11Z)
Re-framing Incremental Deep Language Models for Dialogue Processing with Multi-task Learning [14.239355474794142]
We present a multi-task learning framework to enable the training of one universal incremental dialogue processing model. We show that these tasks provide positive inductive biases to each other with the optimal contribution of each one relying on the severity of the noise from the task.
arXiv Detail & Related papers (2020-11-13T04:31:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.