Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects
- URL: http://arxiv.org/abs/2601.07274v1
- Date: Mon, 12 Jan 2026 07:30:51 GMT
- Title: Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects
- Authors: Kalvin Chang, Yiwen Shao, Jiahong Li, Dong Yu,
- Abstract summary: Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin.<n>We achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data.<n>Our benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs.
- Score: 29.35427502578411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.
Related papers
- What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z) - FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation [10.73307957038715]
FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework.<n>It synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels.
arXiv Detail & Related papers (2025-05-20T13:35:55Z) - Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation [3.9166923630129604]
Bailing-TTS is a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech.
The Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes.
Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation.
arXiv Detail & Related papers (2024-08-01T04:57:31Z) - Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data.<n>Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model.<n>Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z) - SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling.
We propose SpeechTokenizer, a unified speech tokenizer for speech large language models.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Pronunciation Modeling of Foreign Words for Mandarin ASR by Considering
the Effect of Language Transfer [4.675953329876724]
The paper focuses on examining the phonetic effect of language transfer in automatic speech recognition.
A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation.
The proposed lexical rules are generalized and they can be directly applied to unseen English words.
arXiv Detail & Related papers (2022-10-07T14:59:44Z) - A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural
Machine Translation [6.090922774386845]
We propose a novel Chinese dialect TTS with a translation module.
It helps to convert Mandarin text into idiomatic expressions with correct orthography and grammar.
It is the first known work to incorporate translation with TTS.
arXiv Detail & Related papers (2022-06-10T07:46:34Z) - Mandarin-English Code-switching Speech Recognition with Self-supervised
Speech Representation Models [55.82292352607321]
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence.
This paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS.
arXiv Detail & Related papers (2021-10-07T14:43:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.