Fugu-MT 論文翻訳(概要): SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

論文の概要: SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

arxiv url: http://arxiv.org/abs/2511.06499v2
Date: Mon, 17 Nov 2025 03:11:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:22.004038
Title: SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
Title（参考訳）: SportR:スポーツにおけるマルチモーダル大言語モデル推論のためのベンチマーク
Authors: Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen,
Abstract要約: SportRは、スポーツインテリジェンスに必要な基本的な理由に基づいてMLLMを訓練し、評価するために設計された最初のマルチスポーツ大規模ベンチマークである。私たちのベンチマークでは,5,017枚の画像と2,101本のビデオのデータセットが提供されている。罰則の決定や戦術の説明など,多段階の推論を必要とする最も先進的なタスクに対して,我々は7,118の高品質な人間による思考の連鎖(Chain of Thought)アノテーションを提供する。
参考スコア（独自算出の注目度）: 21.410115837645318
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.
Abstract（参考訳）: スポーツを深く理解するには、きめ細かい視覚認識とルールに基づく推論の複雑なブレンドが必要です。成功させるためには、モデルは3つの重要な能力を習得しなければならない: 微妙な視覚的詳細を知覚し、抽象的なスポーツルールの知識を適用し、その知識を特定の視覚的証拠に基礎付ける。現在のスポーツベンチマークでは、単一のスポーツをカバーするか、詳細な推論チェーンが欠如しているか、マルチスポーツ環境でこれらのコア機能を堅牢に評価するために必要な正確なビジュアルグラウンドが欠落している。このギャップに対処するために,スポーツインテリジェンスに必要な基本的理由に基づいてMLLMをトレーニングし,評価するために設計された,最初のマルチスポーツ大規模ベンチマークであるSportRを紹介する。私たちのベンチマークでは,5,017枚の画像と2,101本のビデオのデータセットが提供されている。より詳細な評価を可能にするため、我々は、単純な屈折識別から複雑なペナルティ予測まで、より深い深度での推論を探索するために設計された質問応答(QA)ペアの進行的階層を中心に、ベンチマークを構築した。罰則の決定や戦術の説明などの多段階推論を必要とする最も高度なタスクに対して、我々は7,118の高品質な人間による思考の連鎖(CoT)アノテーションを提供する。さらに,このベンチマークでは画像とビデオの両方のモダリティが組み込まれており,画像部分の視覚的接地を直接テストするための手動バウンディングボックスアノテーションが提供されている。大規模な実験は、我々のベンチマークの難しさを示しています。最先端のベースラインモデルは、最も困難なタスクではパフォーマンスが悪くなります。 Supervised Fine-Tuning and Reinforcement Learningによるデータトレーニングは、これらのスコアを改善する一方で、これらのスコアは比較的低いままであり、現在のモデル機能に大きなギャップがあることを強調します。 SportRはコミュニティに新しい課題を提示し、マルチモーダルスポーツ推論における将来の研究を促進する重要なリソースを提供する。

論文の概要: SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

関連論文リスト