Fugu-MT 論文翻訳(概要): Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

論文の概要: Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

arxiv url: http://arxiv.org/abs/2510.08480v1
Date: Thu, 09 Oct 2025 17:20:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.241041
Title: Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Title（参考訳）: Video-STAR: ツールによるオープンボキャブラリ動作認識の強化
Authors: Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li,
Abstract要約: Video-STARは、オープン語彙行動認識のためのツール強化学習とコンテキストサブモーション分解を調和させるフレームワークである。アクションをモノリシックなエンティティとして扱う従来の方法とは異なり、我々のアプローチは、アクションをきめ細かなマッチングのための差別的なサブモーションに革新的に分解する。本手法は,テキスト中心の推論から視覚的接地推論へ伝達する,明示的な監督を伴わずに,外部ツールを自律的に活用し,サブモーションパターンの優先順位付けを行う。
参考スコア（独自算出の注目度）: 41.993750134878766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
Abstract（参考訳）: MLLM(Multimodal large language model)は、視覚的およびテキスト的推論の橋渡しにおいて顕著な可能性を証明している。そこで本稿では,OVAR(Open-vocabulary Action Recognition, OVAR)のためのツール拡張強化学習とコンテキストサブモーション分解を調和させるフレームワークであるVideo-STARを提案する。動作をモノリシックな実体として扱う従来の方法とは違い,本手法では,動作を細粒度マッチングのための識別サブモーションに分解すると同時に,クロスモーダルインターリービングのためのドメイン固有ツールを動的に起動することで,カテゴリ固有推論能力の実現と,クロスモーダル幻覚の低減を実現している。さらに,ツール使用効率,サブモーション関連性,構造的コヒーレンスと推論のバランスをとる階層的な報酬を設計することにより,外部ツールを自律的に活用し,明示的な監督なしにサブモーションパターンの優先順位付けを行い,テキスト中心の推論から視覚的な推論へと伝達する。 HMDB-51, UCF-101, SSv2, Kinetics-400, Kinetics-600データセットの広範囲な評価は、我々の最先端のパフォーマンスを実証し、微細な動作の識別と相互幻覚の処理における既存の手法よりも優れており、優れた堅牢性と一般化を検証している。

論文の概要: Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

関連論文リスト