Fugu-MT 論文翻訳(概要): GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

論文の概要: GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

arxiv url: http://arxiv.org/abs/2603.17993v1
Date: Wed, 18 Mar 2026 17:54:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.87057
Title: GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes
Title（参考訳）: GMT:3次元シーンにおける6自由度物体軌道合成のためのゴールコンディション型マルチモーダルトランス
Authors: Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang,
Abstract要約: GMTは、現実的でゴール指向のオブジェクトトラジェクトリを生成するマルチモーダルトランスフォーマーフレームワークである。人工的および実世界のベンチマークの実験では、GMTは最先端の人間の動きや人間と物体の相互作用のベースラインより優れていることが示されている。
参考スコア（独自算出の注目度）: 47.88691731631585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
Abstract（参考訳）: 制御可能な6-DOFオブジェクト操作トラジェクトリを3D環境で合成することは、ロボットが複雑なシーンと対話できるためには不可欠であるが、正確な空間的推論、物理的実現性、マルチモーダルなシーン理解の必要性から、依然として困難である。既存のアプローチは、しばしば2Dまたは部分的な3D表現に依存し、全シーンの幾何学を捉え、軌道の精度を制約する能力を制限する。 GMTは3次元境界ボックス形状,ポイントクラウドコンテキスト,セマンティックオブジェクトカテゴリ,ターゲットポーズを併用することで,現実的で目標指向のオブジェクトトラジェクトリを生成するマルチモーダルトランスフォーマーフレームワークである。このモデルは軌道を連続した6-DOFポーズシーケンスとして表現し、幾何学的、意味的、文脈的、目標指向の情報を融合する調整された条件付け戦略を用いる。人工的および実世界のベンチマークに関する大規模な実験により、GMTはCHOISやGIMOのような最先端の人間の動きや人間と物体の相互作用のベースラインより優れており、空間的正確性や方向制御において大きな成果を上げていることが示された。提案手法は,学習に基づく操作計画のための新しいベンチマークを確立し,多種多様なオブジェクトや散在する3D環境への強力な一般化を示す。プロジェクトページ: https://huajian-zeng.github.com io/projects/gmt/。

論文の概要: GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

関連論文リスト