Fugu-MT 論文翻訳(概要): RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

論文の概要: RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

arxiv url: http://arxiv.org/abs/2605.07334v1
Date: Fri, 08 May 2026 06:39:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.866977
Title: RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
Title（参考訳）: RCoT-Seg:ビデオ推論とセグメンテーションのための強化チェーン
Authors: Junwei Wen, Deshui Miao, Guangming Lu, Xin Li, Wenjie Pei,
Abstract要約: Video Reasoningは、人間の意図と時間的ロジックを伝える暗黙の指示に基づいて、対象のオブジェクトをビデオに分割することを目的としている。既存のMLLMベースの手法では,単純なサンプリングや補助MLLMを用いてフレームを選択した後,[SEG]トークンでマスクを予測する。 RCoT-SegはVRSを時間的ビデオ(TVR)と目標知覚(KTP)に分解するビデオ・オブ・思想のフレームワークである。
参考スコア（独自算出の注目度）: 48.30592530624143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is proposed to generate and reselect the keyframe through self-evaluation, strengthening moment localization and temporal reasoning. In the KTP stage, RCoT-Seg performs high-resolution segmentation on the selected frame and propagates masks with SAM2-based methods across the sequence, replacing heuristic sampling and external selectors while improving spatial precision and inter-frame consistency. Extensive experimental results demonstrate that the proposed RCoT-Seg achieves favorable performance against the state-of-the-art methods. The code and models will be publicly released at https://github.com/Victor-wjw/RCoT-Seg.
Abstract（参考訳）: ビデオ推論セグメンテーション(VRS: Video Reasoning Segmentation)は、人間の意図と時間論理を伝える暗黙の指示に基づいて、対象のオブジェクトをビデオに分割することを目的としている。既存のMLLMベースの手法は、単純なサンプリングや補助MLLMによってフレームを選択した後、[SEG]トークンでマスクを予測する。これらの問題に対処するために,VRSを時間的ビデオ推論(TVR)とキーフレーム目標認識(KTP)に分解し,空間的知覚から時間的推論を明確に分離するビデオ・オブ・シント・フレームワークであるRCoT-Segを紹介する。具体的には、TVRの段階では、エージェントキーフレーム選択モジュールがCoT開始コーパスで初期化され、タスクアライン報酬の下でGRPOによって改良され、自己評価、モーメントローカライゼーションの強化、時間的推論によってキーフレームの生成と再選択が提案されている。 KTPの段階では、RCoT-Segは選択したフレーム上で高分解能セグメンテーションを行い、シークエンスをSAM2ベースの方法で伝播し、空間精度とフレーム間の一貫性を改善しながら、ヒューリスティックサンプリングと外部セレクタを置き換える。提案したRCoT-Segは,最先端手法に対して良好な性能を示した。コードとモデルはhttps://github.com/Victor-wjw/RCoT-Seg.comで公開される。

論文の概要: RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

関連論文リスト