Fugu-MT 論文翻訳(概要): X2SAM: Any Segmentation in Images and Videos

論文の概要: X2SAM: Any Segmentation in Images and Videos

arxiv url: http://arxiv.org/abs/2605.00891v1
Date: Mon, 27 Apr 2026 16:24:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.454206
Title: X2SAM: Any Segmentation in Images and Videos
Title（参考訳）: X2SAM:画像とビデオのセグメンテーション
Authors: Hao Wang, Limeng Qiao, Chi Zhang, Lin Ma, Guanglu Wan, Xiangyuan Lan, Xiaodan Liang,
Abstract要約: 画像からビデオまで任意のセグメンテーション機能を拡張した統合セグメンテーションMLLMであるX2SAMを紹介する。 V-VGD(V-VGD)セグメンテーションベンチマークを導入し、インタラクティブな視覚的プロンプトからオブジェクトトラックをビデオに分割できるかどうかを評価する。 X2SAMは、強力なビデオセグメンテーションのパフォーマンスを提供し、データセットに競争力を持ち、一般的な画像とビデオチャットの能力を保っている。
参考スコア（独自算出の注目度）: 62.84804286933252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は画像レベルの視覚的理解と推論を強く示しているが、画像とビデオの両方にわたるピクセルレベルの認識は限られている。 SAMシリーズのような基礎セグメンテーションモデルは高品質なマスクを生成するが、それらは低レベルの視覚的プロンプトに依存しており、複雑な会話命令をネイティブに解釈することはできない。既存のセグメンテーションMLLMは、このギャップを狭めるが、通常、画像またはビデオに特化しており、テキストと視覚の両方のプロンプトを1つのインターフェイスでサポートすることは滅多にない。画像からビデオまで任意のセグメンテーション機能を拡張した統合セグメンテーションMLLMであるX2SAMを紹介する。会話の指示と視覚的プロンプトを与えられたX2SAMは、一時的に一貫したビデオマスク生成のためのガイド付き視覚機能を格納するマスクメモリモジュールとLLMを結合する。同じ定式化は、画像やビデオの入力に対して、汎用的でオープンな語彙、参照、推論、接地された会話生成、対話的、視覚的な接地的なセグメンテーションをサポートする。さらに,V-VGD(V-VGD)セグメンテーションベンチマークを導入し,インタラクティブな視覚的プロンプトから映像中のオブジェクトトラックをセグメンテーションできるかどうかを評価する。ヘテロジニアスな画像とビデオデータセットを統合した共同トレーニング戦略により、X2SAMは強力なビデオセグメンテーション性能を提供し、画像セグメンテーションのベンチマークで競争力を維持し、一般的な画像とビデオチャットの能力を保っている。

論文の概要: X2SAM: Any Segmentation in Images and Videos

関連論文リスト