Fugu-MT 論文翻訳(概要): Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

論文の概要: Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

arxiv url: http://arxiv.org/abs/2606.22009v1
Date: Sat, 20 Jun 2026 12:17:03 GMT
ステータス: 情報取得中
システム内更新日: 2026-06-23 15:21:41.279928
Title: Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study
Title（参考訳）: Grapheme-to-Phoneme変換のための大規模言語モデルのベンチマーク
Authors: Tomoki Koriyama,
Abstract要約: Grapheme-to-phoneme (G2P)変換は、制御可能で堅牢なテキスト音声変換に不可欠である。日本語G2Pで30以上の大言語モデル(LLM)をベンチマークした。 Parseモードは、ほとんどのモデルでダイレクトモードより優れている。 LLM予測仮名をかな入力TSに入力すると、エンドツーエンドTSよりも発音が良くなる。
参考スコア（独自算出の注目度）: 10.32543637637479
License:
Abstract: Grapheme-to-phoneme (G2P) conversion is essential for controllable and robust text-to-speech, and large language models (LLMs), with broad linguistic knowledge, offer a promising approach. We benchmarked over 30 LLMs on Japanese G2P, comparing them with conventional morphological analyzers on 3000 manually annotated sentences. We evaluated two prompting strategies: a parse mode, where the LLM performs morphological analysis followed by rule-based kana conversion, and a direct mode, where the LLM directly predicts kana readings. The results show that model size, version, and Japanese-specialized training are key factors, with the best LLMs achieving kana character error rate below 0.52\% vs. the best conventional tool (1.03\%). Parse mode outperforms direct mode for most models, as rule-based post-processing relieves the LLM of handling complex pronunciation rules. We also show that feeding LLM-predicted kana into a kana-input TTS yields better pronunciation than end-to-end TTS.
Abstract（参考訳）: Grapheme-to-phoneme (G2P)変換は、制御可能で堅牢なテキスト音声変換に必須であり、言語知識の広い大規模言語モデル(LLM)は、有望なアプローチを提供する。日本語G2Pで30以上のLLMをベンチマークし,3000の注釈文で従来の形態素解析器と比較した。我々は,LLMが形態解析を行うパースモードとルールベースの仮名変換を行うダイレクトモードと,LLMが仮名読みを直接予測するダイレクトモードの2つのプロンプト戦略を評価した。その結果, モデルサイズ, バージョン, 日本語特化訓練が重要な要因であることが示され, 最高のLCMでは, 従来のツール(1.03\%)に比べて, 仮名文字誤り率0.52\%以下であることが判明した。 Parseモードは、ルールベースの後処理により、複雑な発音規則を扱うLLMが緩和されるため、ほとんどのモデルで直接モードよりも優れている。また,LLM予測カナをカナ入力TTSに入力することで,エンドツーエンドのTSよりも良好な発音が得られることを示す。

論文の概要: Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

関連論文リスト