Encoding of data sets and algorithms
- URL: http://arxiv.org/abs/2303.00984v1
- Date: Thu, 2 Mar 2023 05:29:27 GMT
- Title: Encoding of data sets and algorithms
- Authors: Katarina Doctor, Tong Mao, Hrushikesh Mhaskar
- Abstract summary: In many high-impact applications, it is important to ensure the quality of output of a machine learning algorithm.
We have initiated a mathematically rigorous theory to decide which models are close to each other in terms of certain metrics.
A given threshold metric acting on this grid will express the nearness (or statistical distance) from each algorithm and data set of interest to any given application.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many high-impact applications, it is important to ensure the quality of
output of a machine learning algorithm as well as its reliability in comparison
with the complexity of the algorithm used. In this paper, we have initiated a
mathematically rigorous theory to decide which models (algorithms applied on
data sets) are close to each other in terms of certain metrics, such as
performance and the complexity level of the algorithm. This involves creating a
grid on the hypothetical spaces of data sets and algorithms so as to identify a
finite set of probability distributions from which the data sets are sampled
and a finite set of algorithms. A given threshold metric acting on this grid
will express the nearness (or statistical distance) from each algorithm and
data set of interest to any given application. A technically difficult part of
this project is to estimate the so-called metric entropy of a compact subset of
functions of \textbf{infinitely many variables} that arise in the definition of
these spaces.
Related papers
- A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Multi-Dimensional Ability Diagnosis for Machine Learning Algorithms [88.93372675846123]
We propose a task-agnostic evaluation framework Camilla for evaluating machine learning algorithms.
We use cognitive diagnosis assumptions and neural networks to learn the complex interactions among algorithms, samples and the skills of each sample.
In our experiments, Camilla outperforms state-of-the-art baselines on the metric reliability, rank consistency and rank stability.
arXiv Detail & Related papers (2023-07-14T03:15:56Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Accelerating ERM for data-driven algorithm design using output-sensitive techniques [26.32088674030797]
We study techniques to develop efficient learning algorithms for data-driven algorithm design.
Our approach involves two novel ingredients -- an output-sensitive algorithm for enumerating polytopes induced by a set of hyperplanes.
We illustrate our techniques by giving algorithms for pricing problems, linkage-based clustering and dynamic-programming based sequence alignment.
arXiv Detail & Related papers (2022-04-07T17:27:18Z) - Pre-Clustering Point Clouds of Crop Fields Using Scalable Methods [14.06711982797654]
We show a similarity between the current state-of-the-art for this problem and a commonly used density-based clustering algorithm, Quickshift.
We propose a number of novel, application specific algorithms with the goal of producing a general and scalable plant segmentation algorithm.
When incorporated into field-scale phenotyping systems, the proposed algorithms should work as a drop in replacement that can greatly improve the accuracy of results.
arXiv Detail & Related papers (2021-07-22T22:47:22Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Fuzzy clustering algorithms with distance metric learning and entropy
regularization [0.0]
This paper proposes fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization.
Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.
arXiv Detail & Related papers (2021-02-18T18:19:04Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - Data-driven Algorithm Design [21.39493074700162]
Data driven algorithm design is an important aspect of modern data science and algorithm design.
A small tweak to the parameters can cause a cascade of changes in the algorithm's behavior.
We provide strong computational and statistical performance guarantees for batch and online scenarios.
arXiv Detail & Related papers (2020-11-14T00:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.