UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets
- URL: http://arxiv.org/abs/2507.12951v1
- Date: Thu, 17 Jul 2025 09:45:49 GMT
- Title: UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets
- Authors: Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li,
- Abstract summary: Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications.<n>We propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture.
- Score: 21.47194295019577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.
Related papers
- UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - CroPrompt: Cross-task Interactive Prompting for Zero-shot Spoken Language Understanding [40.75828713474074]
We present Cross-task Interactive Prompting (CroPrompt) for spoken language understanding (SLU)
CroPrompt enables the model to interactively leverage the information exchange across the correlated tasks in SLU.
We also introduce a multi-task self-consistency mechanism to mitigate the error propagation caused by the intent information injection.
arXiv Detail & Related papers (2024-06-15T04:54:56Z) - Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models [41.524192769406945]
Cross-document event coreference resolution (CDECR) involves clustering event mentions across multiple documents that refer to the same real-world events.
Existing approaches utilize fine-tuning of small language models (SLMs) to address the compatibility among the contexts of event mentions.
We propose a collaborative approach for CDECR, leveraging the capabilities of both a universally capable LLM and a task-specific SLM.
arXiv Detail & Related papers (2024-06-04T09:35:47Z) - Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning [50.1035273069458]
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems.
We propose a multi-level MMCL framework to apply contrastive learning at three levels, including utterance level, slot level, and word level.
Our framework achieves new state-of-the-art results on two public multi-intent SLU datasets.
arXiv Detail & Related papers (2024-05-31T14:34:23Z) - Do Large Language Model Understand Multi-Intent Spoken Language ? [5.494472119991781]
This research signifies a considerable breakthrough in leveraging Large Language Models (LLMs) for multi-intent spoken language understanding (SLU)
Our approach re-imagines the use of entity slots in multi-intent SLU applications.
We introduce the concept of Sub-Intent Instruction (SII) to amplify the analysis and interpretation of complex, multi-intent communications.
arXiv Detail & Related papers (2024-03-07T13:30:52Z) - LLMBind: A Unified Modality-Task Integration Framework [38.95771765322677]
We introduce textbfLLMBind, a novel framework designed to unify a diverse array of multi-modal tasks.
By harnessing a Mixture-of-Experts (MoE) Large Language Model (LLM), LLMBind processes multi-modal inputs and generates task-specific tokens, enabling the invocation of corresponding models to accomplish tasks.
arXiv Detail & Related papers (2024-02-22T12:36:31Z) - UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions [64.50935101415776]
We build a single model that jointly performs various spoken language understanding (SLU) tasks.
We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages.
arXiv Detail & Related papers (2023-10-04T17:10:23Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages.
Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals.
We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.