Improving Inference Performance of Machine Learning with the
Divide-and-Conquer Principle
- URL: http://arxiv.org/abs/2301.05099v1
- Date: Thu, 12 Jan 2023 15:55:12 GMT
- Title: Improving Inference Performance of Machine Learning with the
Divide-and-Conquer Principle
- Authors: Alex Kogan
- Abstract summary: Many popular machine learning models scale poorly when deployed on CPUs.
We propose a simple, yet effective approach based on the Divide-and-Conquer Principle to tackle this problem.
We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Many popular machine learning models scale poorly when deployed on CPUs. In
this paper we explore the reasons why and propose a simple, yet effective
approach based on the well-known Divide-and-Conquer Principle to tackle this
problem of great practical importance. Given an inference job, instead of using
all available computing resources (i.e., CPU cores) for running it, the idea is
to break the job into independent parts that can be executed in parallel, each
with the number of cores according to its expected computational cost. We
implement this idea in the popular OnnxRuntime framework and evaluate its
effectiveness with several use cases, including the well-known models for
optical character recognition (PaddleOCR) and natural language processing
(BERT).
Related papers
- Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.
This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.
We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models [42.124670377223175]
We propose a novel framework for inference acceleration called the Pruning All-Rounder (PAR)
With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios.
arXiv Detail & Related papers (2024-12-09T13:02:35Z) - Scalable Federated Unlearning via Isolated and Coded Sharding [76.12847512410767]
Federated unlearning has emerged as a promising paradigm to erase the client-level data effect.
This paper proposes a scalable federated unlearning framework based on isolated sharding and coded computing.
arXiv Detail & Related papers (2024-01-29T08:41:45Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Efficient Prompting via Dynamic In-Context Learning [76.83516913735072]
We propose DynaICL, a recipe for efficient prompting with black-box generalist models.
DynaICL dynamically allocates in-context examples according to the input complexity and the computational budget.
We find that DynaICL saves up to 46% token budget compared to the common practice that allocates the same number of in-context examples to each input.
arXiv Detail & Related papers (2023-05-18T17:58:31Z) - Matched Machine Learning: A Generalized Framework for Treatment Effect
Inference With Learned Metrics [87.05961347040237]
We introduce Matched Machine Learning, a framework that combines the flexibility of machine learning black boxes with the interpretability of matching.
Our framework uses machine learning to learn an optimal metric for matching units and estimating outcomes.
We show empirically that instances of Matched Machine Learning perform on par with black-box machine learning methods and better than existing matching methods for similar problems.
arXiv Detail & Related papers (2023-04-03T19:32:30Z) - Efficient Sub-structured Knowledge Distillation [52.5931565465661]
We propose an approach that is much simpler in its formulation and far more efficient for training than existing approaches.
We transfer the knowledge from a teacher model to its student model by locally matching their predictions on all sub-structures, instead of the whole output space.
arXiv Detail & Related papers (2022-03-09T15:56:49Z) - Efficient Inference via Universal LSH Kernel [35.22983601434134]
We propose mathematically provable Representer Sketch, a concise set of count arrays that can approximate the inference procedure with simple hashing computations and aggregations.
Representer Sketch builds upon the popular Representer Theorem from kernel literature, hence the name.
We show that Representer Sketch achieves up to 114x reduction in storage requirement and 59x reduction in complexity without any drop in accuracy.
arXiv Detail & Related papers (2021-06-21T22:06:32Z) - Fast Object Segmentation Learning with Kernel-based Methods for Robotics [21.48920421574167]
Object segmentation is a key component in the visual system of a robot that performs tasks like grasping and object manipulation.
We propose a novel architecture for object segmentation, that overcomes this problem and provides comparable performance in a fraction of the time required by the state-of-the-art methods.
Our approach is validated on the YCB-Video dataset which is widely adopted in the computer vision and robotics community.
arXiv Detail & Related papers (2020-11-25T15:07:39Z) - Optimizing Streaming Parallelism on Heterogeneous Many-Core
Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.
Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration.
Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.