Related papers: CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

URL: http://arxiv.org/abs/2509.08438v1
Date: Wed, 10 Sep 2025 09:35:43 GMT
Title: CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
Authors: Jinzhong Ning, Paerhati Tulajiang, Yingying Le, Yijia Zhang, Yuanyuan Sun, Hongfei Lin, Haifeng Liu,
Abstract summary: Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech.<n>Existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech.<n>We introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers.
Score: 21.853908675421504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.

Related papers

OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
CASPER: A Large Scale Spontaneous Speech Dataset [25.446606381490025]
This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data.<n>We plan to expand this dataset in future stages, offering a growing resource for the research community.
arXiv Detail & Related papers (2025-05-30T22:03:59Z)
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities.<n>Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures.<n>Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL.<n>Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios.
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction [110.38946048535033]
This paper introduces Step-Audio, the first production-ready open-source solution for speech recognition.<n>Key contributions include: 1) a unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex
arXiv Detail & Related papers (2025-02-17T15:58:56Z)
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z)
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks [112.6716697906318]
We present Dynamic-SUPERB Phase-2, an open benchmark for the comprehensive evaluation of instruction-based universal speech models.<n>Building upon the first generation, this second version incorporates 125 new tasks, expanding the benchmark to a total of 180 tasks.<n> Evaluation results show that no model performed well universally.
arXiv Detail & Related papers (2024-11-08T06:33:22Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z)
Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations.<n>We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics.<n>We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem. We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.