Fugu-MT 論文翻訳(概要): FoundPose: Unseen Object Pose Estimation with Foundation Features

論文の概要: FoundPose: Unseen Object Pose Estimation with Foundation Features

arxiv url: http://arxiv.org/abs/2311.18809v2
Date: Fri, 19 Jul 2024 09:33:12 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-23 00:16:29.809829
Title: FoundPose: Unseen Object Pose Estimation with Foundation Features
Title（参考訳）: FoundPose: ファンデーション機能によるオブジェクトポス推定
Authors: Evin Pınar Örnek, Yann Labbé, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, Tomas Hodan,
Abstract要約: FoundPoseは、単一のRGB画像から見えないオブジェクトを6Dポーズで推定するモデルベースの手法である。この方法は、オブジェクトやタスク固有のトレーニングを必要とせずに、3Dモデルを使って、新しいオブジェクトを素早くオンボードできる。
参考スコア（独自算出の注目度）: 11.32559845631345
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Project page: evinpinar.github.io/foundpose.
Abstract（参考訳）: 単一RGB画像からの未確認物体の6次元ポーズ推定のためのモデルベース手法であるFoundPoseを提案する。この方法は、オブジェクトやタスク固有のトレーニングを必要とせずに、3Dモデルを使って、新しいオブジェクトを素早くオンボードできる。対照的に、既存のメソッドは、通常、大規模でタスク固有のデータセットで事前トレーニングを行い、新しいオブジェクトに一般化し、画像からモデルへのドメインギャップを埋める。我々は,近年のビジョン基礎モデルにおいて,このような一般化能力が自己指導型で訓練されていることを実証する。具体的には、画像とプレレンダリングされたオブジェクトテンプレート間の最新のDINOv2モデルからのパッチ記述子をマッチングすることにより、画像からモデルへの2D-3D対応からオブジェクトのポーズを推定する。中間DINOv2層からのパッチ記述子のkNNマッチングにより信頼性の高い対応性を確立することができる。これらの記述子は、最終層からの記述子よりも強い位置情報を持ち、対象の対称性やテクスチャの欠如により意味情報が曖昧である場合、それらの重要性を示す。すべてのオブジェクトテンプレートに対する対応性を確立するために,パッチ記述子を単語のback-of-words表現に統合し,類似したテンプレートを素早く提案する,効率的なテンプレート検索手法を開発した。さらに,粗いパッチサンプリングによる2D-3D対応の相違を補うために,特徴量アライメントを適用した。この手法は,標準BOPベンチマークにおいて,7つの異なるデータセットを用いて,RGBのみのポーズ推定を行う既存のRGB手法よりも顕著に優れており,既存のレンダリング・アンド・コンパレンス法とシームレスに組み合わせて,RGBのみの最先端結果が得られる。プロジェクトページ: evinpinar.github.io/foundpose

論文の概要: FoundPose: Unseen Object Pose Estimation with Foundation Features

関連論文リスト