Discrete Audio Tokens: More Than a Survey!
- URL: http://arxiv.org/abs/2506.10274v2
- Date: Mon, 16 Jun 2025 18:38:29 GMT
- Title: Discrete Audio Tokens: More Than a Survey!
- Authors: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli,
- Abstract summary: This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
- Score: 107.69720675124255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Related papers
- Speech Tokenizer is Key to Consistent Representation [0.0]
Speech tokenization is crucial in digital speech processing, converting continuous speech signals into discrete units for various computational tasks.<n>We propose an advanced approach that simultaneously encodes both linguistic and acoustic information, preserving prosodic and emotional content.<n> Empirical evaluations demonstrate its effectiveness in speech coding, voice conversion, emotion recognition, and multimodal language modeling, without requiring additional training.
arXiv Detail & Related papers (2025-07-09T12:43:39Z) - Kimi-Audio Technical Report [67.69331679172303]
Kimi-Audio is an open-source audio foundation model that excels in audio understanding, generation, and conversation.<n>We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation.
arXiv Detail & Related papers (2025-04-25T15:31:46Z) - Summarizing Speech: A Comprehensive Survey [76.13011304983458]
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content.<n>This survey examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches.
arXiv Detail & Related papers (2025-04-10T17:50:53Z) - Recent Advances in Discrete Speech Tokens: A Review [25.038527125935747]
discrete speech tokens, characterized by their discrete, compact, and concise nature, are advantageous for efficient transmission and storage.<n>Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain.<n>This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types.
arXiv Detail & Related papers (2025-02-10T14:08:25Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks.
Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks.
However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z) - How Should We Extract Discrete Audio Tokens from Self-Supervised Models? [15.03039528965825]
This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks.
We propose a scalable solution to train a universal vocoder across multiple SSL layers.
arXiv Detail & Related papers (2024-06-15T20:43:07Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.