Deep ViT Features as Dense Visual Descriptors
- URL: http://arxiv.org/abs/2112.05814v1
- Date: Fri, 10 Dec 2021 20:15:03 GMT
- Title: Deep ViT Features as Dense Visual Descriptors
- Authors: Shir Amir, Yossi Gandelsman, Shai Bagon and Tali Dekel
- Abstract summary: We leverage deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors.
These descriptors facilitate a variety of applications, including co-segmentation, part co-segmentation and correspondences.
- Score: 12.83702462166513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We leverage deep features extracted from a pre-trained Vision Transformer
(ViT) as dense visual descriptors. We demonstrate that such features, when
extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking
properties: (i) the features encode powerful high level information at high
spatial resolution -- i.e., capture semantic object parts at fine spatial
granularity, and (ii) the encoded semantic information is shared across
related, yet different object categories (i.e. super-categories). These
properties allow us to design powerful dense ViT descriptors that facilitate a
variety of applications, including co-segmentation, part co-segmentation and
correspondences -- all achieved by applying lightweight methodologies to deep
ViT features (e.g., binning / clustering). We take these applications further
to the realm of inter-class tasks -- demonstrating how objects from related
categories can be commonly segmented into semantic parts, under significant
pose and appearance changes. Our methods, extensively evaluated qualitatively
and quantitatively, achieve state-of-the-art part co-segmentation results, and
competitive results with recent supervised methods trained specifically for
co-segmentation and correspondences.
Related papers
- Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation [0.0]
Self-supervised vision transformers (ViTs) contain strong semantic and positional information relevant to downstream tasks like object localization and segmentation.
Recent works combine these features with traditional methods like clustering, graph partitioning or region correlations to achieve impressive baselines without finetuning or training additional networks.
arXiv Detail & Related papers (2024-10-20T13:01:53Z) - PSVMA+: Exploring Multi-granularity Semantic-visual Adaption for Generalized Zero-shot Learning [116.33775552866476]
Generalized zero-shot learning (GZSL) endeavors to identify the unseen using knowledge from the seen domain.
GZSL suffers from insufficient visual-semantic correspondences due to attribute diversity and instance diversity.
We propose a multi-granularity progressive semantic-visual adaption network, where sufficient visual elements can be gathered to remedy the inconsistency.
arXiv Detail & Related papers (2024-10-15T12:49:33Z) - Embedding Generalized Semantic Knowledge into Few-Shot Remote Sensing Segmentation [26.542268630980814]
Few-shot segmentation (FSS) for remote sensing (RS) imagery leverages supporting information from limited annotated samples to achieve query segmentation of novel classes.
Previous efforts are dedicated to mining segmentation-guiding visual cues from a constrained set of support samples.
We propose a holistic semantic embedding (HSE) approach that effectively harnesses general semantic knowledge.
arXiv Detail & Related papers (2024-05-22T14:26:04Z) - Exploiting Object-based and Segmentation-based Semantic Features for Deep Learning-based Indoor Scene Classification [0.5572976467442564]
The work described in this paper uses both semantic information, obtained from object detection, and semantic segmentation techniques.
A novel approach that uses a semantic segmentation mask to provide Hu-moments-based segmentation categories' shape characterization, designated by Hu-Moments Features (SHMFs) is proposed.
A three-main-branch network, designated by GOS$2$F$2$App, that exploits deep-learning-based global features, object-based features, and semantic segmentation-based features is also proposed.
arXiv Detail & Related papers (2024-04-11T13:37:51Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation [5.476136494434766]
We introduce EiCue, a technique providing semantic and structural cues through an eigenbasis derived from semantic similarity matrix.
We guide our model to learn object-level representations with intra- and inter-image object-feature consistency.
Experiments on COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art USS results.
arXiv Detail & Related papers (2024-03-03T11:24:16Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them.
Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint.
Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Improving Point Cloud Semantic Segmentation by Learning 3D Object
Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving.
Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes.
We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.