Nearest Neighbor Classifiers over Incomplete Information: From Certain
Answers to Certain Predictions
- URL: http://arxiv.org/abs/2005.05117v2
- Date: Tue, 12 May 2020 10:46:33 GMT
- Title: Nearest Neighbor Classifiers over Incomplete Information: From Certain
Answers to Certain Predictions
- Authors: Bojan Karla\v{s}, Peng Li, Renzhi Wu, Nezihe Merve G\"urel, Xu Chu,
Wentao Wu, Ce Zhang
- Abstract summary: inconsistency and incomplete information are ubiquitous in real-world datasets.
We propose the notion of "Certain Predictions" (CP)
CPClean can significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort.
- Score: 25.249892204626583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) applications have been thriving recently, largely
attributed to the increasing availability of data. However, inconsistency and
incomplete information are ubiquitous in real-world datasets, and their impact
on ML applications remains elusive. In this paper, we present a formal study of
this impact by extending the notion of Certain Answers for Codd tables, which
has been explored by the database research community for decades, into the
field of machine learning. Specifically, we focus on classification problems
and propose the notion of "Certain Predictions" (CP) -- a test data example can
be certainly predicted (CP'ed) if all possible classifiers trained on top of
all possible worlds induced by the incompleteness of data would yield the same
prediction.
We study two fundamental CP queries: (Q1) checking query that determines
whether a data example can be CP'ed; and (Q2) counting query that computes the
number of classifiers that support a particular prediction (i.e., label). Given
that general solutions to CP queries are, not surprisingly, hard without
assumption over the type of classifier, we further present a case study in the
context of nearest neighbor (NN) classifiers, where efficient solutions to CP
queries can be developed -- we show that it is possible to answer both queries
in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of
"data cleaning for machine learning (DC for ML)." We show that our proposed
CPClean approach built based on CP can often significantly outperform existing
techniques in terms of classification accuracy with mild manual cleaning
effort.
Related papers
- Probably Approximately Precision and Recall Learning [60.00180898830079]
A key challenge in machine learning is the prevalence of one-sided feedback.<n>We introduce a Probably Approximately Correct (PAC) framework in which hypotheses are set functions that map each input to a set of labels.<n>We develop new algorithms that learn from positive data alone, achieving optimal sample complexity in the realizable case.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning [8.209660505275872]
We present a Hierarchical Diffusion (HDC) that exploits the inherent hierarchical label structure of a dataset.
HDC can accelerate inference by up to 60% while maintaining and, in some cases, improving classification accuracy.
Our work enables a new control mechanism of the trade-off between speed and precision, making diffusion-based classification more viable for real-world applications.
arXiv Detail & Related papers (2024-11-18T21:34:05Z) - Probabilistically robust conformal prediction [9.401004747930974]
Conformal prediction (CP) is a framework to quantify uncertainty of machine learning classifiers including deep neural networks.
Almost all the existing work on CP assumes clean testing data and there is not much known about the robustness of CP algorithms.
This paper studies the problem of probabilistically robust conformal prediction (PRCP) which ensures robustness to most perturbations.
arXiv Detail & Related papers (2023-07-31T01:32:06Z) - How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench.
An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records.
We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z) - Interpretable by Design: Learning Predictors by Composing Interpretable
Queries [8.054701719767293]
We argue that machine learning algorithms should be interpretable by design.
We minimize the expected number of queries needed for accurate prediction.
Experiments on vision and NLP tasks demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2022-07-03T02:40:34Z) - Certifiable Robustness for Nearest Neighbor Classifiers [6.487663563916903]
We study the complexity of certifying for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN)
Our main focus is on inconsistent datasets when constraints are functional dependencies (FDs)
We exhibit a similar counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label.
arXiv Detail & Related papers (2022-01-13T02:55:10Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Revisiting Deep Local Descriptor for Improved Few-Shot Classification [56.74552164206737]
We show how one can improve the quality of embeddings by leveraging textbfDense textbfClassification and textbfAttentive textbfPooling.
We suggest to pool feature maps by applying attentive pooling instead of the widely used global average pooling (GAP) to prepare embeddings for few-shot classification.
arXiv Detail & Related papers (2021-03-30T00:48:28Z) - Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round.
Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z) - Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier [68.38233199030908]
Long-tail recognition tackles the natural non-uniformly distributed data in realworld scenarios.
While moderns perform well on populated classes, its performance degrades significantly on tail classes.
Deep-RTC is proposed as a new solution to the long-tail problem, combining realism with hierarchical predictions.
arXiv Detail & Related papers (2020-07-20T05:57:42Z) - Estimating g-Leakage via Machine Learning [34.102705643128004]
This paper considers the problem of estimating the information leakage of a system in the black-box scenario.
It is assumed that the system's internals are unknown to the learner, or anyway too complicated to analyze.
We propose a novel approach to perform black-box estimation of the g-vulnerability using Machine Learning (ML) algorithms.
arXiv Detail & Related papers (2020-05-09T09:26:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.