Can We Do Better Than Random Start? The Power of Data Outsourcing
- URL: http://arxiv.org/abs/2205.08098v1
- Date: Tue, 17 May 2022 05:34:36 GMT
- Title: Can We Do Better Than Random Start? The Power of Data Outsourcing
- Authors: Yi Chen, Jing Dong, Xin T. Tong
- Abstract summary: Many organizations have access to abundant data but lack the computational power to process the data.
We propose simulation-based algorithms that can utilize a small amount of outsourced data to find good initial points.
- Score: 9.677679780556103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many organizations have access to abundant data but lack the computational
power to process the data. While they can outsource the computational task to
other facilities, there are various constraints on the amount of data that can
be shared. It is natural to ask what can data outsourcing accomplish under such
constraints. We address this question from a machine learning perspective. When
training a model with optimization algorithms, the quality of the results often
relies heavily on the points where the algorithms are initialized. Random start
is one of the most popular methods to tackle this issue, but it can be
computationally expensive and not feasible for organizations lacking computing
resources. Based on three different scenarios, we propose simulation-based
algorithms that can utilize a small amount of outsourced data to find good
initial points accordingly. Under suitable regularity conditions, we provide
theoretical guarantees showing the algorithms can find good initial points with
high probability. We also conduct numerical experiments to demonstrate that our
algorithms perform significantly better than the random start approach.
Related papers
- The Limits of Assumption-free Tests for Algorithm Performance [6.7171902258864655]
How well does an algorithm perform at a given modeling task, and which algorithm performs best?
We make a distinction between two questions: how good is an algorithm $A$ at the problem of learning from a training set of size $n$, versus, how good is a particular fitted model produced by running $A$ on a particular training data set of size $n$?
arXiv Detail & Related papers (2024-02-12T03:19:30Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Leveraging Data Mining Algorithms to Recommend Source Code Changes [7.959841510571622]
This paper proposes an automatic method for recommending source code changes using four data mining algorithms.
We compare the algorithms in terms of performance (Precision, Recall and F-measure) and execution time.
Apriori seems appropriate for large-scale projects, whereas Eclat appears to be suitable for small-scale projects.
arXiv Detail & Related papers (2023-04-29T18:38:23Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Efficient and Accurate Learning of Mixtures of Plackett-Luce Models [5.216020588360421]
Mixture models of Plackett-Luce (PL) are an active research area of both theoretical and practical significance.
We propose an algorithm that can provide a provably accurate initial estimate and an EM algorithm that maximizes the true log-likelihood function efficiently.
arXiv Detail & Related papers (2023-02-10T16:00:40Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - Optimization for Supervised Machine Learning: Randomized Algorithms for
Data and Parameters [10.279748604797911]
Key problems in machine learning and data science are routinely modeled as optimization problems and solved via optimization algorithms.
With the increase of the volume of data and the size and complexity of the statistical models used to formulate these often ill-conditioned optimization tasks, there is a need for new efficient algorithms able to cope with these challenges.
In this thesis, we deal with each of these sources of difficulty in a different way. To efficiently address the big data issue, we develop new methods which in each iteration examine a small random subset of the training data only.
To handle the big model issue, we develop methods which in each iteration update
arXiv Detail & Related papers (2020-08-26T21:15:18Z) - Run2Survive: A Decision-theoretic Approach to Algorithm Selection based
on Survival Analysis [75.64261155172856]
survival analysis (SA) naturally supports censored data and offers appropriate ways to use such data for learning distributional models of algorithm runtime.
We leverage such models as a basis of a sophisticated decision-theoretic approach to algorithm selection, which we dub Run2Survive.
In an extensive experimental study with the standard benchmark ASlib, our approach is shown to be highly competitive and in many cases even superior to state-of-the-art AS approaches.
arXiv Detail & Related papers (2020-07-06T15:20:17Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z) - How to Solve Fair $k$-Center in Massive Data Models [5.3283669037198615]
We design new streaming and distributed algorithms for the fair $k$-center problem.
Our main contributions are: (a) the first distributed algorithm; and (b) a two-pass streaming algorithm with a provable approximation guarantee.
arXiv Detail & Related papers (2020-02-18T16:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.