A Bag-of-Prototypes Representation for Dataset-Level Applications
- URL: http://arxiv.org/abs/2303.13251v1
- Date: Thu, 23 Mar 2023 13:33:58 GMT
- Title: A Bag-of-Prototypes Representation for Dataset-Level Applications
- Authors: Weijie Tu, Weijian Deng, Tom Gedeon and Liang Zheng
- Abstract summary: This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty.
We propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes.
BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.
- Score: 24.629132557336312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work investigates dataset vectorization for two dataset-level tasks:
assessing training set suitability and test set difficulty. The former measures
how suitable a training set is for a target domain, while the latter studies
how challenging a test set is for a learned model. Central to the two tasks is
measuring the underlying relationship between datasets. This needs a desirable
dataset vectorization scheme, which should preserve as much discriminative
dataset information as possible so that the distance between the resulting
dataset vectors can reflect dataset-to-dataset similarity. To this end, we
propose a bag-of-prototypes (BoP) dataset representation that extends the
image-level bag consisting of patch descriptors to dataset-level bag consisting
of semantic prototypes. Specifically, we develop a codebook consisting of K
prototypes clustered from a reference dataset. Given a dataset to be encoded,
we quantize each of its image features to a certain prototype in the codebook
and obtain a K-dimensional histogram. Without assuming access to dataset
labels, the BoP representation provides a rich characterization of the dataset
semantic distribution. Furthermore, BoP representations cooperate well with
Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although
very simple, BoP consistently shows its advantage over existing representations
on a series of benchmarks for two dataset-level tasks.
Related papers
- A Unified Manifold Similarity Measure Enhancing Few-Shot, Transfer, and Reinforcement Learning in Manifold-Distributed Datasets [1.2289361708127877]
We present a novel method for determining the similarity between two manifold structures.
This method can be used to determine whether the target and source datasets have a similar manifold structure suitable for transfer learning.
We then present a few-shot learning method to classify manifold-distributed datasets with limited labels.
arXiv Detail & Related papers (2024-08-12T01:25:00Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Label-Free Model Evaluation with Semi-Structured Dataset Representations [78.54590197704088]
Label-free model evaluation, or AutoEval, estimates model accuracy on unlabeled test sets.
In the absence of image labels, based on dataset representations, we estimate model performance for AutoEval with regression.
We propose a new semi-structured dataset representation that is manageable for regression learning while containing rich information for AutoEval.
arXiv Detail & Related papers (2021-12-01T18:15:58Z) - Free Lunch for Co-Saliency Detection: Context Adjustment [14.688461235328306]
We propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples.
We collect a novel dataset called Context Adjustment Training. The two variants of our dataset, i.e., CAT and CAT+, consist of 16,750 and 33,500 images, respectively.
arXiv Detail & Related papers (2021-08-04T14:51:37Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Self-supervised Robust Object Detectors from Partially Labelled Datasets [3.1669406516464007]
merging datasets allows us to train one integrated object detector, instead of training several ones.
We propose a training framework to overcome missing-label challenge of the merged datasets.
We evaluate our proposed framework for training Yolo on a simulated merged dataset with missing rate $approx!48%$ using VOC2012 and VOC2007.
arXiv Detail & Related papers (2020-05-23T15:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.