Fugu-MT 論文翻訳(概要): 3D Segmentation Using Viewpoint-Dependent Spatial Relationships

論文の概要: 3D Segmentation Using Viewpoint-Dependent Spatial Relationships

arxiv url: http://arxiv.org/abs/2605.15708v1
Date: Fri, 15 May 2026 07:58:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 03:45:13.165338
Title: 3D Segmentation Using Viewpoint-Dependent Spatial Relationships
Title（参考訳）: 視点依存型空間関係を用いた3次元セグメンテーション
Authors: Ayaka Nanri, Klara Reichard, Mert Kiray, Federico Tombari, Benjamin Busam, Asako Kanezaki,
Abstract要約: 220kのベンチマークサンプルを含む視点対応3Dセグメンテーションデータセットを提案する。このデータセットでは、対象オブジェクトはオブザーバー中心の空間関係によってのみ識別できる。カメラのポーズをエンコードする視点表現を導入し、そのモデルに観察視点を条件づける。
参考スコア（独自算出の注目度）: 55.198821645924234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as "left," "right," "front," and "behind" ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.
Abstract（参考訳）: 3Dデータセットとマルチモーダルモデルの最近の進歩は、自然言語の3Dシーン理解を大幅に改善した。しかし、ほとんどの3次元参照セグメンテーション法は観察者の視点を明示的に表現せず、「左」、「右」、「前」、「後ろ」といった空間的関係を曖昧にし、評価することが困難である。 220kのベンチマークサンプルを含む視点対応3D参照セグメンテーションデータセットを導入し、高密度視点サンプリングにより数千万の視点条件サンプルに拡張する。このデータセットでは、対象オブジェクトはオブザーバー中心の空間関係によってのみ識別することができ、視点条件の接地が必要である。我々は、カメラのポーズを利用して、視点に依存しない関係(上/下)とともに、オブザーバー中心の関係(左/右、前/後ろ)を自動的に注釈付けするベンチマークを構築した。このベンチマークを用いて、ゼロショット設定で既存の3次元大規模マルチモーダルモデルを評価し、現在のモデルが視点に依存した空間的指示に苦しむことを確認する。さらに,3次元大規模マルチモーダルモデルに明示的な視点情報を組み込む方法について検討する。本稿では、カメラのポーズを符号化し、視点に依存した関係のセグメンテーション精度を改善し、mIoUを視点条件のないモデルと比較して0.30から0.47に増加させる視点表現を提案する。データセット、コード、トレーニングされたモデルは、受け入れ次第公開されます。

論文の概要: 3D Segmentation Using Viewpoint-Dependent Spatial Relationships

関連論文リスト