Fugu-MT 論文翻訳(概要): MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

論文の概要: MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

arxiv url: http://arxiv.org/abs/2601.17866v1
Date: Sun, 25 Jan 2026 15:00:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:08.495779
Title: MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance
Title（参考訳）: MV-SAM:ポイントマップ誘導を用いた多視点プロンプタブルセグメンテーション
Authors: Yoonwoo Jeong, Cheng Sun, Yu-Chiang Frank Wang, Minsu Cho, Jaesung Choe,
Abstract要約: 本稿では,ポイントマップを用いた3次元一貫性を実現する多視点セグメンテーションフレームワークMV-SAMを紹介する。 MV-SAMは画像を持ち上げて3D空間にプロンプトし、明示的な3Dネットワークや注釈付き3Dデータを必要としない。
参考スコア（独自算出の注目度）: 79.57732829495843
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps -- 3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel-point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image embeddings from its pretrained encoder into 3D point embeddings, which are decoded by a transformer using cross-attention with 3D prompt embeddings. This design aligns 2D interactions with 3D geometry, enabling the model to implicitly learn consistent masks across views through 3D positional embeddings. Trained on the SA-1B dataset, our method generalizes well across domains, outperforming SAM2-Video and achieving comparable performance with per-scene optimization baselines on NVOS, SPIn-NeRF, ScanNet++, uCo3D, and DL3DV benchmarks. Code will be released.
Abstract（参考訳）: プロンプタブルセグメンテーションはコンピュータビジョンの強力なパラダイムとして登場し、ユーザーはクリック、ボックス、テキストキューなどのプロンプトで複雑なシーンを解析する際にモデルをガイドすることができる。 SAM(Segment Anything Model)によって実証された最近の進歩は、このパラダイムをビデオやマルチビュー画像に拡張している。しかし、3D認識の欠如は、しばしば矛盾した結果をもたらし、3D一貫性を強制するために、シーンごとのコストのかかる最適化を必要とする。本研究では,近年の視覚幾何学モデルによる未提示画像から再構成された3次元点をポイントマップを用いて3次元の整合性を実現するための多視点セグメンテーションフレームワークであるMV-SAMを紹介する。ポイントマップのピクセルポイント1対1対応を利用して、MV-SAMは画像を持ち上げて3D空間にプロンプトし、明示的な3Dネットワークや注釈付き3Dデータを必要としない。具体的には、MV-SAMは、事前訓練されたエンコーダからイメージ埋め込みを3Dポイント埋め込みに上げ、SAMを拡張している。この設計は3次元幾何と2次元の相互作用を一致させ、モデルが3次元位置埋め込みを通してビュー全体で一貫したマスクを暗黙的に学習することを可能にする。 SA-1Bデータセットをトレーニングし,SAM2-Videoより優れ,NVOS,SPIn-NeRF,ScanNet++,uCo3D,DL3DVベンチマーク上で,シーンごとの最適化ベースラインで同等のパフォーマンスを実現する。コードはリリースされる。

論文の概要: MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

関連論文リスト