HyperAid: Denoising in hyperbolic spaces for tree-fitting and
hierarchical clustering
- URL: http://arxiv.org/abs/2205.09721v1
- Date: Thu, 19 May 2022 17:33:16 GMT
- Title: HyperAid: Denoising in hyperbolic spaces for tree-fitting and
hierarchical clustering
- Authors: Eli Chien, Puoya Tabaghi, Olgica Milenkovic
- Abstract summary: We propose a new approach to treemetric denoising (HyperAid) in hyperbolic spaces.
It transforms original data into data that is more'' tree-like, when evaluated in terms of Gromov's $delta$ hyperbolicity.
We integrate HyperAid with schemes for enforcing nonnegative edge-weights.
- Score: 36.738414547278154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of fitting distances by tree-metrics has received significant
attention in the theoretical computer science and machine learning communities
alike, due to many applications in natural language processing, phylogeny,
cancer genomics and a myriad of problem areas that involve hierarchical
clustering. Despite the existence of several provably exact algorithms for
tree-metric fitting of data that inherently obeys tree-metric constraints, much
less is known about how to best fit tree-metrics for data whose structure
moderately (or substantially) differs from a tree. For such noisy data, most
available algorithms perform poorly and often produce negative edge weights in
representative trees. Furthermore, it is currently not known how to choose the
most suitable approximation objective for noisy fitting. Our contributions are
as follows. First, we propose a new approach to tree-metric denoising
(HyperAid) in hyperbolic spaces which transforms the original data into data
that is ``more'' tree-like, when evaluated in terms of Gromov's $\delta$
hyperbolicity. Second, we perform an ablation study involving two choices for
the approximation objective, $\ell_p$ norms and the Dasgupta loss. Third, we
integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a
result, the HyperAid platform outperforms all other existing methods in the
literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on
synthetic and real-world data. Synthetic data is represented by edge-augmented
trees and shortest-distance metrics while the real-world datasets include Zoo,
Iris, Glass, Segmentation and SpamBase; on these datasets, the average
improvement with respect to NJ is $125.94\%$.
Related papers
- Fast unsupervised ground metric learning with tree-Wasserstein distance [14.235762519615175]
unsupervised ground metric learning approaches have been introduced.
We propose to augment the WSV method by embedding samples and features on trees, on which we compute the tree-Wasserstein distance (TWD)
We demonstrate theoretically and empirically that the algorithm converges to a better approximation of the full WSV approach than the best known alternatives, and does so with $mathcalO(n3)$ complexity.
arXiv Detail & Related papers (2024-11-11T23:21:01Z) - Fitting trees to $\ell_1$-hyperbolic distances [4.220336689294244]
Building trees to represent or to fit distances is a critical component of phylogenetic analysis.
We study the tree fitting problem as finding the relation between the hyperbolicity (ultrametricity) vector and the error of tree (ultrametric) embedding.
arXiv Detail & Related papers (2024-09-02T07:38:32Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - Tight and fast generalization error bound of graph embedding in metric
space [54.279425319381374]
We show that graph embedding in non-Euclidean metric space can outperform that in Euclidean space with much smaller training data than the existing bound has suggested.
Our new upper bound is significantly tighter and faster than the existing one, which can be exponential to $R$ and $O(frac1S)$ at the fastest.
arXiv Detail & Related papers (2023-05-13T17:29:18Z) - Optimal Sparse Regression Trees [24.03491277969824]
This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees.
We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm in 1-dimension over the set of labels.
arXiv Detail & Related papers (2022-11-28T00:43:21Z) - Active-LATHE: An Active Learning Algorithm for Boosting the Error
Exponent for Learning Homogeneous Ising Trees [75.93186954061943]
We design and analyze an algorithm that boosts the error exponent by at least 40% when $rho$ is at least $0.8$.
Our analysis hinges on judiciously exploiting the minute but detectable statistical variation of the samples to allocate more data to parts of the graph.
arXiv Detail & Related papers (2021-10-27T10:45:21Z) - Decentralized Learning of Tree-Structured Gaussian Graphical Models from
Noisy Data [1.52292571922932]
This paper studies the decentralized learning of tree-structured Gaussian graphical models (GGMs) from noisy data.
GGMs are widely used to model complex networks such as gene regulatory networks and social networks.
The proposed decentralized learning uses the Chow-Liu algorithm for estimating the tree-structured GGM.
arXiv Detail & Related papers (2021-09-22T10:41:18Z) - Highly Scalable and Provably Accurate Classification in Poincare Balls [40.82908295137667]
We establish a unified framework for learning scalable and simple hyperbolic linear classifiers with provable performance guarantees.
Our results include a new hyperbolic and second-order perceptron algorithm as well as an efficient and highly accurate convex optimization setup for hyperbolic support vector machine classifiers.
Their performance accuracies on synthetic data sets comprising millions of points, as well as on complex real-world data sets such as single-cell RNA-seq expression measurements, CIFAR10, Fashion-MNIST and mini-ImageNet.
arXiv Detail & Related papers (2021-09-08T16:59:39Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance
Segmentation [75.93960390191262]
We exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes.
We propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution.
Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models.
arXiv Detail & Related papers (2020-08-13T03:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.