Fugu-MT 論文翻訳(概要): Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

論文の概要: Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

arxiv url: http://arxiv.org/abs/2603.07660v1
Date: Sun, 08 Mar 2026 14:49:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.008102
Title: Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence
Title（参考訳）: Holi-Spatial:ビデオストリームをホロスティックな3D空間インテリジェンスに進化させる
Authors: Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong,
Abstract要約: Holi-Spatialは、人間の介入なしに生のビデオ入力から構築された、初めて完全に自動化され、大規模で、空間対応のマルチモーダルデータセットである。 Holi-Spatial-4Mは、12K最適化された3DGSシーン、1.3Mの2Dマスク、320Kの3Dバウンディングボックス、320Kのインスタンスキャプション、1.2Mの3Dグラウンドインスタンス、1.2Mの空間QAペアを含む、最初の大規模で高品質な3Dセマンティックデータセットである。
参考スコア（独自算出の注目度）: 78.1406635199656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.
Abstract（参考訳）: 空間知能の追求は、基本的に大規模できめ細かい3Dデータへのアクセスに依存している。しかし、既存のアプローチは、生のウェブデータから新しい大規模3Dシーンを体系的に注釈付けするのではなく、限られた数の手動で注釈付けされたデータセットから質問応答(QA)ペアを生成することで、空間的理解のベンチマークを主に構築する。その結果、スケーラビリティは厳しく制約され、モデルのパフォーマンスは、これらの狭いキュレートされたデータセットに固有のドメインギャップによってさらに妨げられます。本研究では,人間の介入なしに生のビデオ入力から構築した,最初の完全自動化された大規模空間認識型マルチモーダルデータセットであるHoli-Spatialを提案する。 Holi-Spatialは、幾何学的に正確な3次元ガウススプラッティング(3DGS)再構成からオブジェクトレベルおよび関係意味アノテーション、および対応する空間質問応答(QA)ペアまで、多層空間監視をサポートしている。さらに,12K最適化された3DGSシーン,1.3M2Dマスク,320K3Dバウンディングボックス,320Kインスタンスキャプション,1.2M3Dグラウンドリングインスタンス,1.2M空間QAペアを含む,最初の大規模で高品質な3DセマンティックデータセットであるHoli-Spatial-4Mを構築した。 Holi-Spatialは、データキュレーションの品質において例外的なパフォーマンスを示し、ScanNet、ScanNet++、DL3DVといったデータセット上で、既存のフィードフォワードおよびシーンごとの最適化メソッドを著しく上回っている。さらに、このデータセットを用いた空間推論タスクを微調整した視覚言語モデル(VLM)も、モデル性能を大幅に改善した。

論文の概要: Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

関連論文リスト