Fugu-MT 論文翻訳(概要): VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

論文の概要: VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

arxiv url: http://arxiv.org/abs/2605.02834v2
Date: Tue, 05 May 2026 09:59:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 14:45:21.347414
Title: VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Title（参考訳）: VideoNet: ドメイン特有なアクション認識のための大規模データセット
Authors: Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna,
Abstract要約: 我々は37ドメインから1,000の異なるアクションをカバーするドメイン固有のアクション認識ベンチマークであるVideoNetを紹介する。視覚言語モデル(VLM)は、コンテキスト内サンプルを十分に活用するのに苦労している。ドメイン固有のアクションのための、最初の大規模なトレーニングデータセットを収集し、合計で500万近いビデオ質問応答ペアを収集する。
参考スコア（独自算出の注目度）: 107.13283099863123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
Abstract（参考訳）: ビデオは、複数のフレームを横切るアクションをキャプチャできるという点でユニークなものだ。したがって、長年にわたってアクション認識はビデオ理解にとって重要な課題であった。残念ながら、十分に多様で困難なデータがないため、現代の視覚言語モデル(VLM)はもはやその行動認識能力について評価されていない。 VLMの時代にアクション認識を再活性化するために、ドメイン固有のアクションに再びフォーカスすることを提唱する。この目的のために、37ドメインから1,000の異なるアクションをカバーするドメイン固有のアクション認識ベンチマークであるVideoNetを紹介する。 Gemini 3.1 Proは69.9%、Qwen3-VL-8Bは45.0%である。 VLMがVideoNetで苦労している理由を理解するため、ランダムな確率が50%となるバイナリ設定に質問を緩和する。それでもQwenは59.2%の精度しか達成していない。評価設定をさらに緩和し、$k\in\{1,2,3\}$ in-context example of the actionを提供する。 Qwenは$+7.0\%$を、Geminiは$4.8\%$を下げる。注目すべきは、これらの利得は、少数の例を挙げると、非専門家の人間では+13.6\%$の改善に届かなかったことである。 VLMがコンテキスト内サンプルを完全に活用するのに苦労していることに気付き、テスト時間の改善からトレーニング側にシフトします。ドメイン固有のアクションのための、最初の大規模なトレーニングデータセットを収集し、合計で500万近いビデオ質問応答ペアを収集する。データ上でMomo2-4Bモデルを微調整し、VideoNetベンチマークのすべてのオープンウェイト8Bモデルを上回っます。

論文の概要: VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

関連論文リスト