Related papers: Roadmap towards Superhuman Speech Understanding using Large Language Models

Roadmap towards Superhuman Speech Understanding using Large Language Models

URL: http://arxiv.org/abs/2410.13268v1
Date: Thu, 17 Oct 2024 06:44:06 GMT
Title: Roadmap towards Superhuman Speech Understanding using Large Language Models
Authors: Fan Bu, Yuhao Zhang, Xidong Wang, Benyou Wang, Qun Liu, Haizhou Li,
Abstract summary: Large language models (LLMs) integrate speech and audio data. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs. We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
Score: 60.57947401837938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.

Related papers

SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling [10.898666440393896]
Slot filling is a crucial subtask in spoken language understanding (SLU)<n>Recent advent of speech-based large language models (speechLLMs) has opened new avenues for achieving speech understanding tasks.<n>We propose improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result.
arXiv Detail & Related papers (2025-10-17T17:45:29Z)
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z)
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia [18.661797258760974]
We present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC)
arXiv Detail & Related papers (2025-01-23T01:27:46Z)
Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text. This paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z)
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) We present a simple yet effective automatic process for creating speech-text pair data. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing. We reformulate speech processing tasks into speech-to-unit generation tasks. We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z)
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities. Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z)
TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs. We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge [15.67794428589585]
Self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio. The Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.
arXiv Detail & Related papers (2022-10-27T20:32:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.