Related papers: Hierarchical Locality Sensitive Hashing for Structured Data: A Survey

Hierarchical Locality Sensitive Hashing for Structured Data: A Survey

URL: http://arxiv.org/abs/2204.11209v4
Date: Tue, 11 Mar 2025 03:23:22 GMT
Title: Hierarchical Locality Sensitive Hashing for Structured Data: A Survey
Authors: Wei Wu, Bin Li,
Abstract summary: Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors.<n>In this paper, we explore the present progress of the research into hierarchical LSH algorithms.
Score: 8.045541999149002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data similarity (or distance) computation is a fundamental research topic which fosters a variety of similarity-based machine learning and data mining applications. In big data analytics, it is impractical to compute the exact similarity of data instances due to high computational cost. To this end, the Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors in an efficient manner without the learning process. Structured data (e.g., sequences, trees and graphs), which are composed of elements and relations between the elements, are commonly seen in the real world, but the traditional LSH algorithms cannot preserve the structure information represented as relations between elements. In order to conquer the issue, researchers have been devoted to the family of the hierarchical LSH algorithms. In this paper, we explore the present progress of the research into hierarchical LSH from the following perspectives: 1) Data structures, where we review various hierarchical LSH algorithms for three typical data structures and uncover their inherent connections; 2) Applications, where we review the hierarchical LSH algorithms in multiple application scenarios; 3) Challenges, where we discuss some potential challenges as future directions.

Related papers

Score-matching-based Structure Learning for Temporal Data on Networks [17.166362605356074]
Causal discovery is a crucial initial step in establishing causality from empirical data and background knowledge.<n>Current score-matching-based algorithms are primarily designed to analyze independent and identically distributed (i.i.d.) data.<n>We have developed a new parent-finding subroutine for leaf nodes in DAGs, significantly accelerating the most time-consuming part of the process: the pruning step.
arXiv Detail & Related papers (2024-12-10T12:36:35Z)
The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds [0.09208007322096533]
Investigation focuses on HNSW's efficacy across a spectrum of datasets. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality. We observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models.
arXiv Detail & Related papers (2024-05-28T04:16:43Z)
Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner. exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods. We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z)
How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model [47.617093812158366]
We introduce the Random Hierarchy Model: a family of synthetic tasks inspired by the hierarchical structure of language and images. We find that deep networks learn the task by developing internal representations invariant to exchanging equivalent groups. Our results indicate how deep networks overcome the curse of dimensionality by building invariant representations.
arXiv Detail & Related papers (2023-07-05T09:11:09Z)
ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms [5.478671305092084]
We introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms. We develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets.
arXiv Detail & Related papers (2023-05-07T19:28:23Z)
A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z)
Community detection in complex networks via node similarity, graph representation learning, and hierarchical clustering [4.264842058017711]
Community detection is a critical challenge in analysing real graphs. This article proposes three new, general, hierarchical frameworks to deal with this task. We compare over a hundred module combinations on the Block Model graphs and real-life datasets.
arXiv Detail & Related papers (2023-03-21T22:12:53Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Towards Similarity-Aware Time-Series Classification [51.2400839966489]
We study time-series classification (TSC), a fundamental task of time-series data mining. We propose Similarity-Aware Time-Series Classification (SimTSC), a framework that models similarity information with graph neural networks (GNNs)
arXiv Detail & Related papers (2022-01-05T02:14:57Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark [57.10850350508929]
We aim to provide a unified framework to summarize and evaluate existing research on heterogeneous network embedding (HNE) As the first contribution, we provide a generic paradigm for the systematic categorization and analysis over the merits of various existing HNE algorithms. As the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and etcfrom different sources. As the third contribution, we create friendly interfaces for 13 popular HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings.
arXiv Detail & Related papers (2020-04-01T03:42:11Z)
Data Structures & Algorithms for Exact Inference in Hierarchical Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure. Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.