Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
- URL: http://arxiv.org/abs/2510.08480v1
- Date: Thu, 09 Oct 2025 17:20:44 GMT
- Title: Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
- Authors: Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li,
- Abstract summary: Video-STAR is a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition.<n>Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching.<n>Our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference.
- Score: 41.993750134878766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
Related papers
- FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models [20.47311573790516]
We propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging.<n>Experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities.
arXiv Detail & Related papers (2026-01-29T02:36:19Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation [48.806000388608005]
We propose X-Agent, an innovative OVSS framework employing latent semantic-aware agent'' to orchestrate cross-modal attention mechanisms.<n>X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.
arXiv Detail & Related papers (2025-09-01T09:01:58Z) - Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.