Fugu-MT 論文翻訳(概要): When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

論文の概要: When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

arxiv url: http://arxiv.org/abs/2603.10904v1
Date: Wed, 11 Mar 2026 15:48:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:33.038119
Title: When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS
Title（参考訳）: ファインチューニングの失敗と一般化の時--LSMを用いたTSにおけるデータ多様性と混合トレーニングの役割-
Authors: Anupam Purwar, Aditya Choudhary,
Abstract要約: TTSの言語モデルバックボーンの微調整は、音声の一貫性と信号対雑音比SNRを改善することを約束している。話者の忠実度は全ての評価話者に対して改善され、声の類似性が一貫した増加を示す。音響エネルギーと知覚品質のばらつきが高い話者は、DNS-MOS音声の類似性とSNRの同時向上を実現する。
参考スコア（独自算出の注目度）: 0.42970700836450487
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.
Abstract（参考訳）: 大規模言語モデルは、ニューラルテキスト音声システムのセマンティックバックボーンとしてますます採用されている。しかし,LLM表現は話者固有の音響特性や知覚特性をモデル化するには不十分である。 TTSの言語モデルバックボーンの微調整を含む実験は,音声の整合性の向上と音声のクローニング作業における信号対雑音比SNRの向上を約束している。複数の話者からなるLoRAファインタニングは、音声品質の3つの相補的な次元にわたって、非微細化ベースQwen-0.5Bモデルより一貫して優れている。第一に、学習データが十分な音響変化を示す話者に対して、DNS-MOSが最大0.42ポイント向上するにつれて、知覚品質が大幅に向上する。第2に、話者の忠実度は、言語モデリングを劣化させることなく、LoRAが話者の同一性表現を効果的に適応することを示す音声類似度が一貫して増加する全ての評価話者に対して改善される。第3に、信号レベルの品質は、信号対雑音比が最大34%向上するほとんどのケースで改善される。これらの改善は、トレーニングデータの特徴によって強く管理されている。音響エネルギーと知覚品質のばらつきが高い話者は、DNS-MOS音声の類似性とSNRの同時向上を実現する。全体として、この研究はLoRAファインタニングが単にパラメータ効率のよい最適化技術であるだけでなく、コンパクトLLMベースのTSシステムにおいて話者レベルの適応性を向上するための効果的なメカニズムであることを証明している。十分に多様なトレーニングデータによってサポートされた場合、Qwen-0.5Bは、量子化形式でホストされたGGUFモデルを用いて、知覚品質の話者類似性において、その凍結ベースモデルを一貫して上回る。

論文の概要: When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

関連論文リスト