Related papers: Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation

Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation

URL: http://arxiv.org/abs/2505.13338v2
Date: Tue, 03 Jun 2025 04:11:48 GMT
Title: Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation
Authors: Qiongqiong Wang, Hardik B. Sailor, Tianchi Liu, Ai Ti Aw,
Abstract summary: Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding.<n>We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information.
Score: 15.242292693077417
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation. The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset. The results also reveal the speech-LLM's limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models. The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.

Related papers

Unlocking Speech Instruction Data Potential with Query Rewriting [26.134056897363557]
End-to-end Large Speech Language Models(textbfLSLMs) demonstrate strong potential in response latency and speech comprehension capabilities.<n>However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks.<n>We propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech.
arXiv Detail & Related papers (2025-07-11T13:55:45Z)
Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction.<n>Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data.<n>We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT) Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z)
Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations.<n>We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics.<n>We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z)
Enhancing Argument Structure Extraction with Efficient Leverage of Contextual Information [79.06082391992545]
We propose an Efficient Context-aware model (ECASE) that fully exploits contextual information. We introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information. Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-08T08:47:10Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.