Related papers: S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

URL: http://arxiv.org/abs/2505.14438v1
Date: Tue, 20 May 2025 14:42:20 GMT
Title: S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
Authors: Yuanbo Fang, Haoze Sun, Jun Liu, Tao Zhang, Zenan Zhou, Weipeng Chen, Xiaofen Xing, Xiangmin Xu,
Abstract summary: End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens.<n>This often leads to a decline in reasoning and generation performance compared to text input.<n>We propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs.
Score: 14.060679420379516
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

Related papers

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation [14.57248739077317]
This paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference.<n>SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder.
arXiv Detail & Related papers (2025-04-22T01:05:32Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow [57.51550409392103]
We introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.<n>To address these challenges, we decompose the speech signal into manageable subspaces, each representing distinct speech attributes, and predict them directly from the visual input.<n>To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture.
arXiv Detail & Related papers (2024-11-29T05:55:20Z)
A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation [45.47457657122893]
Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy. Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time. We propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation.
arXiv Detail & Related papers (2023-01-25T14:27:00Z)
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.