Related papers: Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval

URL: http://arxiv.org/abs/2405.18025v2
Date: Mon, 30 Sep 2024 12:50:13 GMT
Title: Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval
Authors: Dvir Samuel, Rami Ben-Ari, Matan Levy, Nir Darshan, Gal Chechik,
Abstract summary: Self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. A significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. We propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training.
Score: 31.48981364573974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.

Related papers

Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts [13.21626568246313]
We analyze whether vision-language foundation models can be adapted to target datasets with very different distributions and classes.<n>We propose a novel prompt-tuning method, PromptMargin, for adapting such large-scale VLMs directly on the few target samples.<n>PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules.
arXiv Detail & Related papers (2025-05-21T13:26:56Z)
Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations [1.8142185304787555]
We propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive.
arXiv Detail & Related papers (2025-01-30T16:48:02Z)
Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models [4.157013247909771]
We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer) We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset.
arXiv Detail & Related papers (2023-11-17T21:58:26Z)
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z)
Semi-Supervised Learning for hyperspectral images by non parametrically predicting view assignment [25.198550162904713]
Hyperspectral image (HSI) classification is gaining a lot of momentum in present time because of high inherent spectral information within the images. Recently, to effectively train the deep learning models with minimal labelled samples, the unlabeled samples are also being leveraged in self-supervised and semi-supervised setting. In this work, we leverage the idea of semi-supervised learning to assist the discriminative self-supervised pretraining of the models.
arXiv Detail & Related papers (2023-06-19T14:13:56Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts. We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z)
UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training. We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning [83.48587570246231]
Visual Similarity plays an important role in many computer vision applications. Deep metric learning (DML) is a powerful framework for learning such similarities. We propose and study multiple complementary learning tasks, targeting conceptually different data relationships. We learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance.
arXiv Detail & Related papers (2020-04-28T12:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.