DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition
- URL: http://arxiv.org/abs/2511.11000v2
- Date: Mon, 17 Nov 2025 02:49:01 GMT
- Title: DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition
- Authors: HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai,
- Abstract summary: DialogGraph-LLM is an end-to-end framework for recognizing speaker intent in audio dialogues.<n>It combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models for direct acoustic-to-intent inference.<n>The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues.
- Score: 10.94195981338177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM's superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations [62.00227663434538]
DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks.<n>This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling.
arXiv Detail & Related papers (2025-06-11T02:57:22Z) - Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z) - SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z) - From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification [21.6988262735281]
Chain-of-Intent is a novel framework that integrates Hidden Markov Models with Large Language Models to generate intent-driven, context-aware dialogues.<n> MINT-CL is a contrastive learning framework for multi-turn intent classification, which improves performance while reducing dependence on large-scale annotated datasets.
arXiv Detail & Related papers (2024-11-21T15:59:29Z) - Unsupervised End-to-End Task-Oriented Dialogue with LLMs: The Power of the Noisy Channel [9.082443585886127]
Training task-oriented dialogue systems typically require turn-level annotations for interacting with their APIs.
Unlabeled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised.
We propose an innovative approach using expectation-maximization (EM) that infers turn-level annotations as latent variables.
arXiv Detail & Related papers (2024-04-23T16:51:26Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.