Fugu-MT 論文翻訳(概要): Object Pose Transformer: Unifying Unseen Object Pose Estimation

論文の概要: Object Pose Transformer: Unifying Unseen Object Pose Estimation

arxiv url: http://arxiv.org/abs/2603.23370v1
Date: Tue, 24 Mar 2026 16:04:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.575224
Title: Object Pose Transformer: Unifying Unseen Object Pose Estimation
Title（参考訳）: Object Pose Transformer: 見えないオブジェクトのPose推定を統一する
Authors: Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam,
Abstract要約: モデルなしのオブジェクトポーズ推定を未知のインスタンスで学習することは、3Dビジョンにおける根本的な課題である。我々のチームは、RGB入力から深度、ポイントマップ、カメラパラメータ、正規化されたオブジェクト座標を共同で予測します。当社はカメラ非依存で、カメラ固有の知識をオンザフライで学習し、メトリックスケールリカバリのためのオプションの深度入力をサポートします。
参考スコア（独自算出の注目度）: 54.20344997573707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.
Abstract（参考訳）: モデルなしのオブジェクトポーズ推定を未知のインスタンスで学習することは、3Dビジョンにおける根本的な課題である。カテゴリーレベルのアプローチは、標準空間における絶対的なポーズを予測するが、事前に定義された分類に依存し、相対的なポーズ法はクロスビュー変換を推定するが、単一ビューの絶対的なポーズを回復することはできない。本研究では,単一モデル内のタスク因数分解を通じてこれらのパラダイムをブリッジする統合フィードフォワードフレームワークであるObject Pose Transformer (\ours{})を提案する。 \ours{} は、RGB入力から深度、点マップ、カメラパラメータ、および正規化されたオブジェクト座標(NOCS)を共同で予測し、カテゴリレベルの絶対SA(3)ポーズと見えない相対SE(3)ポーズの両方を可能にする。提案手法では, カメラ空間の表現としてポイントマップを用いて, 多視点の相対的幾何学的推論を可能にする。クロスフレームな特徴相互作用と共有オブジェクトの埋め込みを通じて、ビュー間の相対的な幾何的整合性を活用し、絶対的なポーズ推定を改善し、単一ビュー予測におけるあいまいさを低減する。さらに、 \ours{} はカメラに依存しない、カメラ固有の学習であり、メトリックスケールのリカバリのためにオプションの深度入力をサポートしながら、RGBのみの設定で完全に機能する。多様なベンチマーク(NOCS、HouseCat6D、Omni6DPose、Toyota-Light)に関する大規模な実験は、単一の統一アーキテクチャ内の絶対的および相対的ポーズ推定タスクにおいて、最先端のパフォーマンスを示す。

論文の概要: Object Pose Transformer: Unifying Unseen Object Pose Estimation

関連論文リスト