Fugu-MT 論文翻訳(概要): Lost in Translation: Do LVLM Judges Generalize Across Languages?

論文の概要: Lost in Translation: Do LVLM Judges Generalize Across Languages?

arxiv url: http://arxiv.org/abs/2604.19405v1
Date: Tue, 21 Apr 2026 12:29:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.76616
Title: Lost in Translation: Do LVLM Judges Generalize Across Languages?
Title（参考訳）: 翻訳の損失:LVLM判事は全言語を一般化するのか?
Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman, Shafiq Joty, Enamul Hoque, Jimmy Huang,
Abstract要約: MM-JudgeBenchは,マルチリンガルおよびマルチモーダルの判断モデル評価のための,最初の大規模ベンチマークである。 MM-JudgeBenchには、25のタイプ的多種多様な言語にまたがる60万以上のペアの好みインスタンスが含まれている。 LVLMを22個評価することにより,提案するベンチマークにおいて,言語間性能のかなりのばらつきを明らかにした。
参考スコア（独自算出の注目度）: 46.119587015038746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.
Abstract（参考訳）: 報酬モデルのような自動評価器は、大きな視覚言語モデル(LVLM)のアライメントと評価において中心的な役割を果たす。その重要性が増しているにもかかわらず、これらの評価者は英語中心のベンチマークでほぼ独占的に評価され、これらの評価者が言語全体にわたってどのように一般化するかという疑問が残る。 MM-JudgeBenchは多言語および多モーダルな判断モデル評価のための最初の大規模ベンチマークであり、25言語にまたがる60万以上のペアワイドな選好インスタンスを含む。 MM-JudgeBenchは、VL-RewardBenchを拡張した一般的な視覚言語嗜好評価サブセットと、OpenCQAから派生したチャート中心の視覚テキスト推論サブセットの2つの補完サブセットを統合し、様々な設定で報酬モデル(LVLM審査員)の体系的な分析を可能にする。また,MM-RewardBenchから派生した多言語学習セットを評価データから分離し,ドメイン適応を支援する。 22のLVLM(15のオープンソース、7のプロプライエタリ)を評価することにより、提案するベンチマークにおいて、言語間性能のかなりのばらつきを明らかにする。さらに, モデルサイズとアーキテクチャが多言語的堅牢性の予測因子として不十分であること, 最先端のLVLM審査員でさえ言語間の一貫性のない振る舞いを示すこと, などを明らかにした。これらの結果は、現在の報酬モデルの基本的限界を明らかにし、信頼性の高い自動評価器を開発するための多言語・マルチモーダルベンチマークの必要性を強調している。

論文の概要: Lost in Translation: Do LVLM Judges Generalize Across Languages?

関連論文リスト