Fugu-MT 論文翻訳(概要): Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

論文の概要: Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

arxiv url: http://arxiv.org/abs/2510.19892v1
Date: Wed, 22 Oct 2025 17:21:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:16.527029
Title: Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
Title（参考訳）: ディクシットはできるのか? できる! マルチモーダル言語モデル機能のためのプレイグラウンドとしてのディクシット
Authors: Nishant Balepur, Dang Nguyen, Dayeon Ki,
Abstract要約: 本稿では,機能評価のためのゲームベース評価を提案する。ゲームはプレイヤーが勝つために複数の能力を必要とし、本質的に競争力があり、固定された客観的ルールによって支配される。我々はこの評価をファンタジーカードゲームであるDixitを通じて具体的に示す。
参考スコア（独自算出の注目度）: 17.019600215402704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.
Abstract（参考訳）: マルチモーダルな大規模言語モデル(MLM)は、静的な個々のベンチマークで評価されることが多く、単一のタスクでMLM機能を共同評価できない、あるいは人またはモデルのペアワイズ比較に依存しない、非常に主観的で高価であり、モデルが勝利率を向上するために表面的なショートカット(例えば、冗長性)を利用することを可能にする。これらの課題を克服するために,ゲームベースの評価手法を提案する。ゲームはプレイヤーが勝つために複数の能力を必要とし、本質的に競争力があり、固定、客観的ルールによって支配され、評価をより活発にし、上記の課題に対処するための堅牢なフレームワークを提供する。この評価は、プレイヤーがカードのキャプションを生成しなければならないファンタジーカードゲームであるDixitで、プレイヤーはプレイカードを選択するために、一部のプレイヤーを騙すのではなく、すべてのプレイヤーを騙す必要がある。 5つのMLMを用いた定量的実験により、Dxitの勝利率ランキングは人気のあるMLMベンチマークと完全に相関し、Dxitの人間とMLMプレーヤーのゲームはエージェント戦略とMLM推論の改善領域にいくつかの違いを呈している。

論文の概要: Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

関連論文リスト