Biclustering and Boolean Matrix Factorization in Data Streams
- URL: http://arxiv.org/abs/2012.03138v1
- Date: Sat, 5 Dec 2020 23:02:43 GMT
- Title: Biclustering and Boolean Matrix Factorization in Data Streams
- Authors: Stefan Neumann, Pauli Miettinen
- Abstract summary: We study the clustering of bipartite graphs and Boolean matrix factorization in data streams.
We provide an algorithm that, after one pass over the stream, recovers the set of clusters on the right side of the graph using sublinear space.
We evaluate an implementation of the algorithm on synthetic data and on real-world data.
- Score: 12.005731086591139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the clustering of bipartite graphs and Boolean matrix factorization
in data streams. We consider a streaming setting in which the vertices from the
left side of the graph arrive one by one together with all of their incident
edges. We provide an algorithm that, after one pass over the stream, recovers
the set of clusters on the right side of the graph using sublinear space; to
the best of our knowledge, this is the first algorithm with this property. We
also show that after a second pass over the stream, the left clusters of the
bipartite graph can be recovered and we show how to extend our algorithm to
solve the Boolean matrix factorization problem (by exploiting the
correspondence of Boolean matrices and bipartite graphs). We evaluate an
implementation of the algorithm on synthetic data and on real-world data. On
real-world datasets the algorithm is orders of magnitudes faster than a static
baseline algorithm while providing quality results within a factor 2 of the
baseline algorithm. Our algorithm scales linearly in the number of edges in the
graph. Finally, we analyze the algorithm theoretically and provide sufficient
conditions under which the algorithm recovers a set of planted clusters under a
standard random graph model.
Related papers
- A Differentially Private Clustering Algorithm for Well-Clustered Graphs [6.523602840064548]
We provide an efficient ($epsilon,$delta$)-DP algorithm tailored specifically for such graphs.
Our algorithm works for well-clustered graphs with $k$ nearly-balanced clusters.
arXiv Detail & Related papers (2024-03-21T11:57:16Z) - Fast and Simple Spectral Clustering in Theory and Practice [7.070726553564701]
Spectral clustering is a popular and effective algorithm designed to find $k$ clusters in a graph $G$.
We present a simple spectral clustering algorithm based on a vertices embedding with $O(log(k))$ computed by the power method.
We evaluate the new algorithm on several synthetic and real-world datasets, finding that it is significantly faster than alternative clustering algorithms, while producing results with approximately the same clustering accuracy.
arXiv Detail & Related papers (2023-10-17T02:31:57Z) - Latent Random Steps as Relaxations of Max-Cut, Min-Cut, and More [30.919536115917726]
We present a probabilistic model based on non-negative matrix factorization which unifies clustering and simplification.
By relaxing the hard clustering to a soft clustering, our algorithm relaxes potentially hard clustering problems to a tractable ones.
arXiv Detail & Related papers (2023-08-12T02:47:57Z) - Learning the Positions in CountSketch [49.57951567374372]
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem.
In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries.
arXiv Detail & Related papers (2023-06-11T07:28:35Z) - Efficient Graph Field Integrators Meet Point Clouds [59.27295475120132]
We present two new classes of algorithms for efficient field integration on graphs encoding point clouds.
The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds.
arXiv Detail & Related papers (2023-02-02T08:33:36Z) - Triangle and Four Cycle Counting with Predictions in Graph Streams [59.05440236993604]
We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles.
We use a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms.
Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases.
arXiv Detail & Related papers (2022-03-17T19:26:00Z) - Skew-Symmetric Adjacency Matrices for Clustering Directed Graphs [5.301300942803395]
Cut-based directed graph (digraph) clustering often focuses on finding dense within-cluster or sparse between-cluster connections.
For flow-based clusterings the edges between clusters tend to be oriented in one direction and have been found in migration data, food webs, and trade data.
arXiv Detail & Related papers (2022-03-02T20:07:04Z) - Hierarchical Clustering: $O(1)$-Approximation for Well-Clustered Graphs [3.2901541059183432]
We present two-time approximation algorithms for hierarchical clustering.
The significance of our work is demonstrated by the empirical analysis on both synthetic and real-world data sets.
arXiv Detail & Related papers (2021-12-16T17:52:04Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Nearly Linear Row Sampling Algorithm for Quantile Regression [54.75919082407094]
We give a row sampling algorithm for the quantile loss function with sample complexity nearly linear in the dimensionality of the data.
Based upon our row sampling algorithm, we give the fastest known algorithm for quantile regression and a graph sparsification algorithm for balanced directed graphs.
arXiv Detail & Related papers (2020-06-15T13:40:07Z) - MC2G: An Efficient Algorithm for Matrix Completion with Social and Item
Similarity Graphs [85.89744949820376]
MC2G is an algorithm that performs matrix completion in the presence of social and item similarity graphs.
It is based on spectral clustering and local refinement steps.
We show via extensive experiments on both synthetic and real datasets that MC2G outperforms other state-of-the-art matrix completion algorithms.
arXiv Detail & Related papers (2020-06-08T06:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.