Fugu-MT 論文翻訳(概要): Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

論文の概要: Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

arxiv url: http://arxiv.org/abs/2603.06982v1
Date: Sat, 07 Mar 2026 01:54:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.518368
Title: Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning
Title（参考訳）: 画像に基づく形状検索のための多モードモデル最適化:事前調整とハードコントラスト学習の役割
Authors: Paul Julius Kühn, Cedric Spengler, Michael Weinmann, Arjan Kuijper, Saptarshi Neil Sinha,
Abstract要約: 画像に基づく形状検索(IBSR)は、クエリ画像が与えられたデータベースから3Dモデルを取得することを目的としている。我々は、大規模マルチモーダル事前訓練を通じてIBSRに対処し、明確なビューベース監視は不要であることを示す。
参考スコア（独自算出の注目度）: 8.222080530754223
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on $Acc_{Top1}$ and $Acc_{Top10}$ for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.
Abstract（参考訳）: 画像ベース形状検索(IBSR)は、クエリ画像が与えられたデータベースから3Dモデルを取得することを目的としており、コンピュータビジョン、コンピュータグラフィックス、ロボット工学における古典的な課題に対処することを目的としている。近年のアプローチでは、多視点レンダリングとタスク固有のメトリック学習を用いて2次元画像と3次元形状の領域ギャップを埋めることが一般的である。対照的に、大規模なマルチモーダル事前訓練を通じてIBSRに対処し、明確なビューベース監視は不要であることを示す。 3次元形状分類などのタスクに使用されてきたULIPやOpenShapeのイメージ・ポイント・クラウド・エンコーダに着想を得て,画像とポイント・クラウドを共有表現空間に埋め込んで,コンパクトな単一埋め込み形状記述子による類似検索を行うことにより,ゼロショットおよび標準IBSRのための画像・形状エンコーダを提案する。この定式化により、スキップビュー合成が可能となり、ターゲットデータベースで再トレーニングすることなく、ゼロショットとクロスドメイン検索が自然に可能となる。我々は、ゼロショットと教師付きIBSR設定の両方で事前整列エンコーダを評価し、さらに、検索性能を高めるために、マルチモーダルハードコントラッシブロス(HCL)を導入する。 Acc_{Top1}$ および $Acc_{Top10}$ を用いて,OpenShape と Point-BERT を併用した形状検索を行い,その性能評価を行った。さらに,提案したマルチモーダルHCLのトレーニングにより,形状中心データに基づく標準インスタンス検索タスクのデータセット依存的なゲインが得られ,事前学習の価値と3次元形状検索におけるハードコントラスト学習の意義が強調される。コードはプロジェクトのWebサイトから入手できる。

論文の概要: Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

関連論文リスト