Fugu-MT 論文翻訳(概要): Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

論文の概要: Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

arxiv url: http://arxiv.org/abs/2603.20887v1
Date: Sat, 21 Mar 2026 17:26:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.135107
Title: Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
Title（参考訳）: 制御可能なビデオセグメンテーション・キャプションのための微粒配向を有するシーングラフ誘導セグキャプション変換器
Authors: Xu Zhang, Jin Yuan, BinHong Yang, Xuan Liu, Qianjun Zhang, Yuyi Wang, Zhiyong Li, Hanwang Zhang,
Abstract要約: SegCaptioningは、ユーザ意図に基づいて相関マスクとキャプションを生成する新しいツールである。革新的なフレームワークは、Prompt-guided Temporal Graph formerときめ細かいマスク言語デコーダを統合している。提案モデルでは,ユーザの意図を効果的に把握し,ユーザ仕様に合わせて正確なマルチモーダル出力を生成する。
参考スコア（独自算出の注目度）: 47.59654620860484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users' understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user's requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users' comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at https://github.com/XuZhang1211/SG-FSCFormer.
Abstract（参考訳）: 近年のマルチモーダル大モデルの進歩は、ビデオ多モーダル解釈の進化を触媒として、多様なモーダル間の表現ギャップを著しく橋渡しし、相関モーダルを生成することで、ビデオコンテンツの理解を深めている。しかし,既存のビデオマルチモーダル解釈手法のほとんどは,ユーザインタラクションに制限のあるグローバル理解に重点を置いている。そこで本研究では,ユーザが興味の対象の周囲に有界ボックスなどの特定のプロンプトを提供することで,ユーザの意図を正確に表現したマスクやキャプションを同時に生成する,制御可能なビデオセグメンテーション・キャプション(セグキャプション)を提案する。革新的なフレームワーク Scene Graph-guided Fine-fine SegCaptioning Transformer (SG-FSCFormer) は、Prompt-guided Temporal Graph formerを統合して、適応的なプロンプトアダプタを通じてユーザの意図を効果的にキャプチャし、表現し、生成されたコンテンツがユーザの要求によく適合するように設計されている。さらに,マルチエンタリティ・コントラッシブ・ロスを用いて高品質なキャプション・マスクペアを協調的に予測し,各マスクとその対応するキャプション・トークン間の微粒なアライメントを提供することにより,ユーザのビデオの理解を深める。 2つのベンチマークデータセットで実施された総合的な実験により、SG-FSCFormerは、ユーザの意図を効果的に捉え、ユーザ仕様に合わせて正確なマルチモーダルアウトプットを生成することで、優れたパフォーマンスを実現していることが示された。私たちのコードはhttps://github.com/XuZhang1211/SG-FSCFormerで利用可能です。

論文の概要: Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

関連論文リスト