Fugu-MT 論文翻訳(概要): REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

論文の概要: REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

arxiv url: http://arxiv.org/abs/2510.16410v1
Date: Sat, 18 Oct 2025 08:53:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.98479
Title: REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
Title（参考訳）: REALM: オープンワールド3D推論セグメンテーションとガウス版編集のためのMLLM-Agentフレームワーク
Authors: Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu,
Abstract要約: 既存の3Dセグメンテーション手法は、しばしば曖昧で推論に基づく指示を解釈するのに苦労する。本稿では,オープンワールド推論に基づくセグメンテーションを実現する,革新的なMLLMエージェントフレームワークであるREALMを紹介する。我々のフレームワークは、オブジェクトの削除、置換、スタイル転送など、様々な3Dインタラクションタスクをシームレスにサポートしています。
参考スコア（独自算出の注目度）: 16.896443736904356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.
Abstract（参考訳）: 複雑な人間の指示と正確な3Dオブジェクトグラウンドのギャップを埋めることは、視覚とロボット工学において重要な課題である。既存の3Dセグメンテーション法は、しばしば曖昧で推論に基づく指示を解釈するのに苦労するが、そのような推論で優れた2D視覚言語モデルは、固有の3D空間理解を欠いている。本稿では,オープンワールドの推論に基づくセグメンテーションを実現するための革新的なMLLMエージェントフレームワークであるREALMを紹介する。我々は,MLLMの理解に非常に適したフォトリアリスティックなノベルビューを描画する能力を活かして,3次元ガウススプラッティング表現を直接的にセグメンテーションを行う。 MLLMに1つ以上のレンダリングされたビューを直接供給することにより、視点選択に高い感度をもたらすことができるので、我々は、新しいグローバル・ローカル空間接地戦略を提案する。具体的には、複数のグローバルビューをMLLMエージェントに並列に入力し、粗いレベルのローカライゼーションを行い、応答を集約してターゲットオブジェクトを堅牢に識別する。そして、オブジェクトのいくつかのクローズアップな新しいビューを合成し、きめ細かい局所的なセグメンテーションを行い、正確で一貫した3Dマスクを生成する。拡張実験により,REALMはLERF,3D-OVS,および新たに導入されたREALM3Dベンチマークにおいて,明示的命令と暗黙的命令の両方を解釈する際,顕著な性能を示した。さらに,エージェントフレームワークはオブジェクトの削除,置換,スタイル転送など,さまざまな3Dインタラクションタスクをシームレスにサポートし,実用性と汎用性を示す。プロジェクトページ: https://ChangyueShi.github.io/REALM。

論文の概要: REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

関連論文リスト