Related papers: Distributed Graph Embedding with Information-Oriented Random Walks

Distributed Graph Embedding with Information-Oriented Random Walks

URL: http://arxiv.org/abs/2303.15702v2
Date: Sun, 25 Feb 2024 08:12:26 GMT
Title: Distributed Graph Embedding with Information-Oriented Random Walks
Authors: Peng Fang, Arijit Khan, Siqiang Luo, Fang Wang, Dan Feng, Zhenli Li, Wei Yin, Yuchao Cao
Abstract summary: Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. We present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. D DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.
Score: 16.290803469068145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.

Related papers

GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs [9.024357901512928]
Graph-based computations are crucial in a wide range of applications, where graphs can scale to trillions of edges. Existing solutions face significant trade-offs: online subgraph generation is limited to a single machine, resulting in severe performance bottlenecks. We propose textbfGraphGen+, an integrated framework that synchronizes distributed subgraph generation with in-memory graph learning.
arXiv Detail & Related papers (2025-03-08T13:29:42Z)
GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs [6.418397511692011]
We propose a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings.
arXiv Detail & Related papers (2024-07-22T08:09:36Z)
Graph Transformers for Large Graphs [57.19338459218758]
This work advances representation learning on single large-scale graphs with a focus on identifying model characteristics and critical design constraints. A key innovation of this work lies in the creation of a fast neighborhood sampling technique coupled with a local attention mechanism. We report a 3x speedup and 16.8% performance gain on ogbn-products and snap-patents, while we also scale LargeGT on ogbn-100M with a 5.9% performance improvement.
arXiv Detail & Related papers (2023-12-18T11:19:23Z)
HUGE: Huge Unsupervised Graph Embeddings with TPUs [6.108914274067702]
Graph Embedding is a process of creating a continuous representation of nodes in a graph. A high-performance graph embedding architecture leveraging amounts of high-bandwidth memory is presented. We verify the embedding space quality on real and synthetic large-scale datasets.
arXiv Detail & Related papers (2023-07-26T20:29:15Z)
DOTIN: Dropping Task-Irrelevant Nodes for GNNs [119.17997089267124]
Recent graph learning approaches have introduced the pooling strategy to reduce the size of graphs for learning. We design a new approach called DOTIN (underlineDrunderlineopping underlineTask-underlineIrrelevant underlineNodes) to reduce the size of graphs. Our method speeds up GAT by about 50% on graph-level tasks including graph classification and graph edit distance.
arXiv Detail & Related papers (2022-04-28T12:00:39Z)
Scaling R-GCN Training with Graph Summarization [71.06855946732296]
Training of Relation Graph Convolutional Networks (R-GCN) does not scale well with the size of the graph. In this work, we experiment with the use of graph summarization techniques to compress the graph. We obtain reasonable results on the AIFB, MUTAG and AM datasets.
arXiv Detail & Related papers (2022-03-05T00:28:43Z)
GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy [5.466414428765544]
We present a new distributed graph learning system GraphTheta. It supports multiple training strategies and enables efficient and scalable learning on big graphs. This work represents the largest edge-attributed GNN learning task conducted on a billion-scale network in the literature.
arXiv Detail & Related papers (2021-04-21T14:51:33Z)
Distributed Training of Graph Convolutional Networks using Subgraph Approximation [72.89940126490715]
We propose a training strategy that mitigates the lost information across multiple partitions of a graph through a subgraph approximation scheme. The subgraph approximation approach helps the distributed training system converge at single-machine accuracy.
arXiv Detail & Related papers (2020-12-09T09:23:49Z)
Scaling Graph Neural Networks with Approximate PageRank [64.92311737049054]
We present the PPRGo model which utilizes an efficient approximation of information diffusion in GNNs. In addition to being faster, PPRGo is inherently scalable, and can be trivially parallelized for large datasets like those found in industry settings. We show that training PPRGo and predicting labels for all nodes in this graph takes under 2 minutes on a single machine, far outpacing other baselines on the same graph.
arXiv Detail & Related papers (2020-07-03T09:30:07Z)
Wasserstein Embedding for Graph Learning [33.90471037116372]
Wasserstein Embedding for Graph Learning (WEGL) is a framework for embedding entire graphs in a vector space. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. We evaluate our new graph embedding approach on various benchmark graph-property prediction tasks.
arXiv Detail & Related papers (2020-06-16T18:23:00Z)
Block-Approximated Exponential Random Graphs [77.4792558024487]
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs. We propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions. Our methods are scalable to sparse graphs consisting of millions of nodes.
arXiv Detail & Related papers (2020-02-14T11:42:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.