Mostly Beneficial Clustering: Aggregating Data for Operational Decision
Making
- URL: http://arxiv.org/abs/2311.17326v2
- Date: Sun, 17 Dec 2023 09:04:47 GMT
- Title: Mostly Beneficial Clustering: Aggregating Data for Operational Decision
Making
- Authors: Chengzhang Li, Zhenkang Peng, and Ying Rong
- Abstract summary: We propose a cluster-based Shrunken-SAA approach that can exploit the cluster structure among problems.
We prove that, as the number of problems grows, leveraging the given cluster structure among problems yields additional benefits.
Our proposed approach can be extended to general cost functions under mild conditions.
- Score: 3.9825334703672812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With increasingly volatile market conditions and rapid product innovations,
operational decision-making for large-scale systems entails solving thousands
of problems with limited data. Data aggregation is proposed to combine the data
across problems to improve the decisions obtained by solving those problems
individually. We propose a novel cluster-based Shrunken-SAA approach that can
exploit the cluster structure among problems when implementing the data
aggregation approaches. We prove that, as the number of problems grows,
leveraging the given cluster structure among problems yields additional
benefits over the data aggregation approaches that neglect such structure. When
the cluster structure is unknown, we show that unveiling the cluster structure,
even at the cost of a few data points, can be beneficial, especially when the
distance between clusters of problems is substantial. Our proposed approach can
be extended to general cost functions under mild conditions. When the number of
problems gets large, the optimality gap of our proposed approach decreases
exponentially in the distance between the clusters. We explore the performance
of the proposed approach through the application of managing newsvendor systems
via numerical experiments. We investigate the impacts of distance metrics
between problem instances on the performance of the cluster-based Shrunken-SAA
approach with synthetic data. We further validate our proposed approach with
real data and highlight the advantages of cluster-based data aggregation,
especially in the small-data large-scale regime, compared to the existing
approaches.
Related papers
- A3S: A General Active Clustering Method with Pairwise Constraints [66.74627463101837]
A3S features strategic active clustering adjustment on the initial cluster result, which is obtained by an adaptive clustering algorithm.
In extensive experiments across diverse real-world datasets, A3S achieves desired results with significantly fewer human queries.
arXiv Detail & Related papers (2024-07-14T13:37:03Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy
Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located.
The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z) - Neural Capacitated Clustering [6.155158115218501]
We propose a new method for the Capacitated Clustering Problem (CCP) that learns a neural network to predict the assignment probabilities of points to cluster centers.
In our experiments on artificial data and two real world datasets our approach outperforms several state-of-the-art mathematical and solvers from the literature.
arXiv Detail & Related papers (2023-02-10T09:33:44Z) - Differentially Private Federated Clustering over Non-IID Data [59.611244450530315]
clustering clusters (FedC) problem aims to accurately partition unlabeled data samples distributed over massive clients into finite clients under the orchestration of a server.
We propose a novel FedC algorithm using differential privacy convergence technique, referred to as DP-Fed, in which partial participation and multiple clients are also considered.
Various attributes of the proposed DP-Fed are obtained through theoretical analyses of privacy protection, especially for the case of non-identically and independently distributed (non-i.i.d.) data.
arXiv Detail & Related papers (2023-01-03T05:38:43Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - How to Use K-means for Big Data Clustering? [2.1165011830664677]
K-means is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model.
We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering.
arXiv Detail & Related papers (2022-04-14T08:18:01Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Fast and Interpretable Consensus Clustering via Minipatch Learning [0.0]
We develop IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering.
We develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings.
Results show that our approach yields more accurate and interpretable cluster solutions.
arXiv Detail & Related papers (2021-10-05T22:39:28Z) - ThetA -- fast and robust clustering via a distance parameter [3.0020405188885815]
Clustering is a fundamental problem in machine learning where distance-based approaches have dominated the field for many decades.
We propose a new set of distance threshold methods called Theta-based Algorithms (ThetA)
arXiv Detail & Related papers (2021-02-13T23:16:33Z) - reval: a Python package to determine best clustering solutions with
stability-based relative clustering validation [1.8129328638036126]
reval is a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions.
This work aims at developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data.
arXiv Detail & Related papers (2020-08-27T10:36:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.