AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation
- URL: http://arxiv.org/abs/2509.02349v2
- Date: Thu, 04 Sep 2025 14:25:57 GMT
- Title: AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation
- Authors: Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang, Haodi Zhang,
- Abstract summary: Multimodal Large Language Models (MLLMs) have been widely applied in speech and music.<n>Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details.<n>This paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework.
- Score: 16.047087043580053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details. Moreover, they provide a discrete method for speech and music that can be effectively integrated into MLLMs. However, existing research is unsuitable in the definitions of semantic tokens and acoustic tokens. In addition, the evaluation of different codecs typically concentrates on specific domains or tasks, such as reconstruction or Automatic Speech Recognition (ASR) task, which prevents fair and comprehensive comparisons. To address these problems, this paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework. This framework allows for a comprehensive assessment of codecs' capabilities which evaluate across four dimensions: audio reconstruction metric, codebook index (ID) stability, decoder-only transformer perplexity, and performance on downstream probe tasks. Our results show the correctness of the provided suitable definitions and the correlation among reconstruction metrics, codebook ID stability, downstream probe tasks and perplexity.
Related papers
- STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs [19.07983030478734]
STACodec integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1)<n>We propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference.
arXiv Detail & Related papers (2026-02-05T20:36:24Z) - Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation [79.13636675697096]
Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS)<n>MQA-RefAVS is a task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations.<n>We propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information.
arXiv Detail & Related papers (2026-02-03T07:47:59Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z) - Discrete Audio Tokens: More Than a Survey! [107.69720675124255]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks.
Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks.
However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z) - How Should We Extract Discrete Audio Tokens from Self-Supervised Models? [15.03039528965825]
This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks.
We propose a scalable solution to train a universal vocoder across multiple SSL layers.
arXiv Detail & Related papers (2024-06-15T20:43:07Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.