Fugu-MT 論文翻訳(概要): SceneForge: Structured World Supervision from 3D Interventions

論文の概要: SceneForge: Structured World Supervision from 3D Interventions

arxiv url: http://arxiv.org/abs/2605.14399v1
Date: Thu, 14 May 2026 05:38:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.641911
Title: SceneForge: Structured World Supervision from 3D Interventions
Title（参考訳）: SceneForge:3Dインタラクションによる構造化された世界スーパービジョン
Authors: Jizhizi Li, Jiayang Ao, Danny Wicks, Petru-Daniel Tudosiu,
Abstract要約: マルチモーダル学習タスクは、編集、視点、シーンレベルの介入に対して一貫性のある監督を必要とする。編集可能な3D世界状態から構造化された監視を生成する、介入駆動型フレームワークであるSceneForgeを提案する。
参考スコア（独自算出の注目度）: 5.973748478214713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.
Abstract（参考訳）: 多くのマルチモーダル学習タスクは、編集、視点、シーンレベルの介入に一貫している監督を必要とする。しかし、そのような監視は、下層の状況や変化がどのように伝播するかを公開しない観測レベルのデータセットから得ることは困難である。編集可能な3D世界状態から構造化された監視を生成する、介入駆動型フレームワークであるSceneForgeを提案する。 SceneForgeは、各シーンを意味的、幾何学的、物理的に依存した永続的な世界として表現する。明示的な介入(オブジェクトの削除やカメラのバリエーションなど)を適用して、シーン依存性を通じてその効果を伝播することにより、SceneForgeはオブジェクトの構造やシーンレベルの影響と整合した監督を行う。これは、反事実観測、多視点観測、影や反射などの効果認識信号を含む整列出力を、すべてポストホック画像空間処理ではなく共有世界状態から生成する。 Infinigen と Blender を用いて SceneForge をインスタンス化し,多数の対物対と2K 以上のシーンからのアノテーションをアライメントしたライセンスクリーンな屋内監視リソースを構築する。一致したトレーニング予算の下では、SceneForgeのインスペクションを取り入れることで、定量評価と定性評価の両方において、複数のベンチマークでオブジェクト削除とシーン削除のパフォーマンスが向上する。これらの結果は、編集可能な世界における構造化状態遷移としてのモデリングの監督が、介入一貫性のあるマルチモーダル学習の実践的でスケーラブルな基盤となることを示唆している。

論文の概要: SceneForge: Structured World Supervision from 3D Interventions

関連論文リスト