Fugu-MT 論文翻訳(概要): Disentangling Language Roles in Multilingual LLM Task Execution

論文の概要: Disentangling Language Roles in Multilingual LLM Task Execution

arxiv url: http://arxiv.org/abs/2605.27649v1
Date: Tue, 26 May 2026 20:09:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.500662
Title: Disentangling Language Roles in Multilingual LLM Task Execution
Title（参考訳）: 多言語LLMタスク実行における言語の役割の分散化
Authors: Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He,
Abstract要約: MTM-Benchは言語条件のタスク実行のためのベンチマークである。 27のトリプレットをすべて列挙し、セマンティック・リバーサル、最終状態抽出、言語純度にまたがるモデル毎に2,430のインスタンスを含む。セマンティックな正当性,目標言語順守,制約満足度,汚染率,共同成功の指標を用いて,20のフロンティアとオープンウェイトLLMを評価した。
参考スコア（独自算出の注目度）: 17.182371695349385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.
Abstract（参考訳）: 命令、ソース内容、必要な応答言語が一致しない場合には、多言語 LLM がますます使われる。既存のベンチマークでは、多言語による命令追従の評価が拡張されているが、これら3つの役割を完全に交差した設計で分離することは滅多にない。 MTM-Benchは、言語条件付きタスク実行のための制御されたベンチマークで、各インスタンスは三重項 \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\) で定義される。英語、スペイン語、中国語にまたがって、MTM-Benchは27のトリプル全てを列挙し、セマンティック・リバーサル、最終状態抽出、言語純度にまたがるモデル毎に2{,}430のインスタンスを含む。本研究は, 意味的正当性, 目標言語順守性, 制約満足度, 汚染率, 共同成功度を, 対象の人間監査によって評価された評価値を用いて, 20のフロンティアとオープンウェイトLLMを評価した。完全に交差した設計は、言語がタスク構造に占める役割によって、単にミスマッチ数によってではなく、分解が組織されることを示している。応答言語の役割は変動の主軸であり、単一の応答スロットミスマッチがほとんどの劣化の原因である。応答のみおよび全ミスマッチ比較は、ミスマッチ数が困難を単調に予測するものではなく、モデルレベルの順序はシステムによって異なることを示唆している。タスクファミリは異なるチャネルを通して失敗し、セマンティックな正確性だけでは信頼できる多言語タスクの実行をキャプチャできないことを示す。

論文の概要: Disentangling Language Roles in Multilingual LLM Task Execution

関連論文リスト