Fugu-MT 論文翻訳(概要): English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

論文の概要: English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

arxiv url: http://arxiv.org/abs/2509.02915v1
Date: Wed, 03 Sep 2025 00:56:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.37518
Title: English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM
Title（参考訳）: 複雑な関節訓練を伴わない英語発音評価:LoRAファインチューニング音声マルチモーダルLLM
Authors: Taekyung Ahn, Hosung Nam,
Abstract要約: 本研究では,ローランド適応 (LoRA) を用いて適応したマルチモーダル大規模言語モデル (MLLM) が,APA (Automatic Pronunciation Assessment) とMDD (Mispronunciation Detection and Diagnosis) を同時に実行可能であることを示す。我々の微調整手法は、複雑なアーキテクチャの変更や、これらの異なるタスクのための個別のトレーニング手順の必要性を排除します。本研究は,大規模マルチモーダルモデルを完全微調整なしで適用することにより,統合発音評価システムを構築することができることを示す。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.
Abstract（参考訳）: 本研究では,ローランド適応 (LoRA) を用いて適応したマルチモーダル大規模言語モデル (MLLM) が,APA (Automatic Pronunciation Assessment) とMDD (Mispronunciation Detection and Diagnosis) を同時に実行可能であることを示す。 MicrosoftのPhi-4-multimodal-instructを活用することで、我々の微調整手法は、これらの異なるタスクに従来必要だった複雑なアーキテクチャ変更や個別のトレーニング手順の必要性を排除します。 Speechocean762 データセットを微調整し,Pearson 相関係数 (PCC > 0.7) と人間指定スコア,低単語誤り率 (WER) と低音素誤り率 (PER) (ともに 0.15 であった。特筆すべきは、LoRA層のみを微調整することで、すべてのオーディオ層を微調整することによって達成されたものと同等のパフォーマンスレベルを達成するのに十分であったことである。本研究は, APA と MDD を併用して設計した従来のジョイントモデルと比較して, 完全微調整なしで大規模マルチモーダルモデルを適応させることにより, 統合発音評価システムを構築することができることを示す。この効率的なLoRAベースのアプローチは、よりアクセスしやすく、統合され、効果的なコンピュータ支援発音訓練(CAPT)技術を英語のL2学習者に提供するための道を開く。

論文の概要: English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

関連論文リスト