Fine-Tuned but Zero-Shot 3D Shape Sketch View Similarity and Retrieval
- URL: http://arxiv.org/abs/2306.08541v2
- Date: Thu, 27 Jul 2023 10:07:14 GMT
- Title: Fine-Tuned but Zero-Shot 3D Shape Sketch View Similarity and Retrieval
- Authors: Gianluca Berardi and Yulia Gryaditskaya
- Abstract summary: We show that in a zero-shot setting, the more abstract the sketch, the higher the likelihood of incorrect image matches.
One of the key findings of our research is that meticulous fine-tuning on one class of 3D shapes can lead to improved performance on other shape classes.
- Score: 8.540349872620993
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, encoders like ViT (vision transformer) and ResNet have been trained
on vast datasets and utilized as perceptual metrics for comparing sketches and
images, as well as multi-domain encoders in a zero-shot setting. However, there
has been limited effort to quantify the granularity of these encoders. Our work
addresses this gap by focusing on multi-modal 2D projections of individual 3D
instances. This task holds crucial implications for retrieval and sketch-based
modeling. We show that in a zero-shot setting, the more abstract the sketch,
the higher the likelihood of incorrect image matches. Even within the same
sketch domain, sketches of the same object drawn in different styles, for
example by distinct individuals, might not be accurately matched. One of the
key findings of our research is that meticulous fine-tuning on one class of 3D
shapes can lead to improved performance on other shape classes, reaching or
surpassing the accuracy of supervised methods. We compare and discuss several
fine-tuning strategies. Additionally, we delve deeply into how the scale of an
object in a sketch influences the similarity of features at different network
layers, helping us identify which network layers provide the most accurate
matching. Significantly, we discover that ViT and ResNet perform best when
dealing with similar object scales. We believe that our work will have a
significant impact on research in the sketch domain, providing insights and
guidance on how to adopt large pretrained models as perceptual losses.
Related papers
- Inverse Neural Rendering for Explainable Multi-Object Tracking [35.072142773300655]
We recast 3D multi-object tracking from RGB cameras as an emphInverse Rendering (IR) problem.
We optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties.
We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data.
arXiv Detail & Related papers (2024-04-18T17:37:53Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Learning 3D Human Pose Estimation from Dozens of Datasets using a
Geometry-Aware Autoencoder to Bridge Between Skeleton Formats [80.12253291709673]
We propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks.
Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model.
arXiv Detail & Related papers (2022-12-29T22:22:49Z) - Structure-Aware 3D VR Sketch to 3D Shape Retrieval [113.20120789493217]
We focus on the challenge caused by inherent inaccuracies in 3D VR sketches.
We propose to use a triplet loss with an adaptive margin value driven by a "fitting gap"
We introduce a dataset of 202 VR sketches for 202 3D shapes drawn from memory rather than from observation.
arXiv Detail & Related papers (2022-09-19T14:29:26Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - A Closer Look at Invariances in Self-supervised Pre-training for 3D
Vision [0.0]
Self-supervised pre-training for 3D vision has drawn increasing research interest in recent years.
We present a unified framework under which various pre-training methods can be investigated.
We propose a simple but effective method that jointly pre-trains a 3D encoder and a depth map encoder using contrastive learning.
arXiv Detail & Related papers (2022-07-11T16:44:15Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned
from Synthetic Data [3.71630298053787]
We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape.
Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training.
arXiv Detail & Related papers (2021-08-09T14:44:08Z) - Contrastive Spatial Reasoning on Multi-View Line Drawings [11.102238863932255]
State-of-the-art supervised deep networks show puzzling low performances on the SPARE3D dataset.
We propose a simple contrastive learning approach along with other network modifications to improve the baseline performance.
Our approach uses a self-supervised binary classification network to compare the line drawing differences between various views of any two similar 3D objects.
arXiv Detail & Related papers (2021-04-27T19:05:27Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.