Related papers: PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

URL: http://arxiv.org/abs/2407.04538v3
Date: Mon, 22 Jul 2024 09:41:39 GMT
Title: PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
Authors: Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos,
Abstract summary: We show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work.
Score: 7.4774909520731425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.

Related papers

Detection Based Part-level Articulated Object Reconstruction from Single RGBD Image [52.11275397911693]
We propose an end-to-end trainable, cross-category method for reconstructing multiple man-made articulated objects from a single RGBD image. We depart from previous works that rely on learning instance-level latent space, focusing on man-made articulated objects with predefined part counts. Our method successfully reconstructs variously structured multiple instances that previous works cannot handle, and outperforms prior works in shape reconstruction and kinematics estimation.
arXiv Detail & Related papers (2025-04-04T05:08:04Z)
Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors [21.888294850224554]
We propose a Transformer-based Openable Part Detection framework named Multi-feature Openable Part Detection (MOPD) Compared to existing methods, our proposed approach shows better performance in both detection and motion parameter prediction.
arXiv Detail & Related papers (2024-12-17T18:52:30Z)
ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices [13.533267828812455]
We propose a novel Vision Transformer splitting framework, ED-ViT, to execute complex models across multiple edge devices efficiently. Specifically, we partition Vision Transformer models into several sub-models, where each sub-model is tailored to handle a specific subset of data classes. We conduct extensive experiments on five datasets with three model structures, demonstrating that our approach significantly reduces inference latency on edge devices.
arXiv Detail & Related papers (2024-10-15T14:38:14Z)
Foundation Models on a Budget: Approximating Blocks in Large Vision Models [32.686851504117314]
Transformer Blocks Approximation (TBA) is a novel method that leverages intra-network similarities to identify and approximate transformer blocks in large vision models.<n>TBA replaces these blocks using lightweight, closed-form transformations, without retraining or fine-tuning the rest of the model.<n>We validate the effectiveness and generalizability of TBA through extensive experiments across multiple datasets.
arXiv Detail & Related papers (2024-10-07T11:35:24Z)
Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [58.60915132222421]
We introduce an approach that is both general and parameter-efficient for face forgery detection. We design a forgery-style mixture formulation that augments the diversity of forgery source domains. We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z)
Geometric Features Enhanced Human-Object Interaction Detection [11.513009304308724]
We propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI) One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions.
arXiv Detail & Related papers (2024-06-26T18:52:53Z)
Split-and-Fit: Learning B-Reps via Structure-Aware Voronoi Partitioning [50.684254969269546]
We introduce a novel method for acquiring boundary representations (B-Reps) of 3D CAD models. We apply a spatial partitioning to derive a single primitive within each partition. We show that our network, coined NVD-Net for neural Voronoi diagrams, can effectively learn Voronoi partitions for CAD models from training data.
arXiv Detail & Related papers (2024-06-07T21:07:49Z)
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public. ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z)
Engineering the Neural Collapse Geometry of Supervised-Contrastive Loss [28.529476019629097]
Supervised-contrastive loss (SCL) is an alternative to cross-entropy (CE) for classification tasks. We propose methods to engineer the geometry of learnt feature embeddings by modifying the contrastive loss.
arXiv Detail & Related papers (2023-10-02T04:23:17Z)
PDiscoNet: Semantically consistent part discovery for fine-grained recognition [62.12602920807109]
We propose PDiscoNet to discover object parts by using only image-level class labels along with priors encouraging the parts to be. Our results on CUB, CelebA, and PartImageNet show that the proposed method provides substantially better part discovery performance than previous methods.
arXiv Detail & Related papers (2023-09-06T17:19:29Z)
Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer [41.44769642537572]
Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-03T10:52:06Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.