How to Query An Oracle? Efficient Strategies to Label Data
- URL: http://arxiv.org/abs/2110.02341v1
- Date: Tue, 5 Oct 2021 20:15:35 GMT
- Title: How to Query An Oracle? Efficient Strategies to Label Data
- Authors: Farshad Lahouti, Victoria Kostina, Babak Hassibi
- Abstract summary: We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning.
We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of $O(fracNk2)$.
In addition, we present an adaptive greedy query scheme, which achieves an average rate of $approx 0.2N$ queries per sample with triplet queries.
- Score: 59.89900843097016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the basic problem of querying an expert oracle for labeling a
dataset in machine learning. This is typically an expensive and time consuming
process and therefore, we seek ways to do so efficiently. The conventional
approach involves comparing each sample with (the representative of) each class
to find a match. In a setting with $N$ equally likely classes, this involves
$N/2$ pairwise comparisons (queries per sample) on average. We consider a
$k$-ary query scheme with $k\ge 2$ samples in a query that identifies
(dis)similar items in the set while effectively exploiting the associated
transitive relations. We present a randomized batch algorithm that operates on
a round-by-round basis to label the samples and achieves a query rate of
$O(\frac{N}{k^2})$. In addition, we present an adaptive greedy query scheme,
which achieves an average rate of $\approx 0.2N$ queries per sample with
triplet queries. For the proposed algorithms, we investigate the query rate
performance analytically and with simulations. Empirical studies suggest that
each triplet query takes an expert at most 50\% more time compared with a
pairwise query, indicating the effectiveness of the proposed $k$-ary query
schemes. We generalize the analyses to nonuniform class distributions when
possible.
Related papers
- Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity [16.331196225467707]
We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning.
We employ a learning framework that allows sending queries in $K$ batches, with feedback being processed and queries updated after each batch.
arXiv Detail & Related papers (2023-10-02T20:14:01Z) - JoinGym: An Efficient Query Optimization Environment for Reinforcement
Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost.
We present JoinGym, a query optimization environment for bushy reinforcement learning (RL)
Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z) - Fast Interactive Search with a Scale-Free Comparison Oracle [17.38671584773247]
A comparison-based search algorithm lets a user find a target item $t$ in a database by answering queries of the form.
We propose a scale-free probabilistic oracle model called $gamma$-CKL for such similarity triplets $(i,j;t)$.
We develop a search algorithm with provably exponential rate of convergence under the $gamma$-CKL oracle, thanks to a backtracking strategy.
arXiv Detail & Related papers (2023-06-02T09:33:19Z) - On the Sample Complexity of Representation Learning in Multi-task
Bandits with Global and Local structure [77.60508571062958]
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems.
Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor)
We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(Glog(delta_G)+ Xlog(delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors.
arXiv Detail & Related papers (2022-11-28T08:40:12Z) - Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with
a Faulty Oracle [7.449644976563424]
We propose an elegant theoretical model for studying clustering with a faulty oracle.
It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters.
We provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(log2 n)$) for all constant $k$ and any $delta$ in the regime when information-theoretic recovery is possible.
arXiv Detail & Related papers (2021-06-18T22:20:12Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - A Provably Efficient Sample Collection Strategy for Reinforcement
Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z) - Query complexity of heavy hitter estimation [6.373263986460191]
We consider the problem of identifying the subset $mathcalSgamma_mathcalP$ of elements in the support of an underlying distribution $mathcalP$.
We consider two query models: $(a)$ each query is an index $i$ and the oracle return the value $X_i$ and $(b)$ each query is a pair $(i,j)$.
For each of these query models, we design sequential estimation algorithms which at each round, either decide what query to send to the oracle depending on the entire
arXiv Detail & Related papers (2020-05-29T07:15:46Z) - Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering.
A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities.
We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.