HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis
- URL: http://arxiv.org/abs/2509.02113v1
- Date: Tue, 02 Sep 2025 09:10:52 GMT
- Title: HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis
- Authors: Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, Wenjie Zhang,
- Abstract summary: We introduce dataset, the largest public hierarchical graph dataset for malware analysis, comprising over textbf200M Control Flow Graphs (CFGs) nested within textbf595K Call Graphs (FCGs)<n>This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution.<n>We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community.
- Score: 28.52072763032641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf{200M} Control Flow Graphs (CFGs) nested within \textbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.
Related papers
- Better Call Graphs: A New Dataset of Function Call Graphs for Malware Classification [1.201622168415522]
We introduce Better Call Graphs (BCG), a comprehensive dataset of large and unique Function Call Graphs (FCGs) extracted from recent Android application packages (APKs)<n>BCG includes both benign and malicious samples spanning various families and types, along with graph-level features for each APK.
arXiv Detail & Related papers (2025-12-24T01:21:38Z) - Dynamic Deep Graph Learning for Incomplete Multi-View Clustering with Masked Graph Reconstruction Loss [26.31060859315329]
We propose a novel textbfDynamic Deep textbfGraph Learning for textbfIncomplete textbfMulti-textbfView textbfView textbfClustering with textbfMasked Graph Reconstruction Loss (DGIMVCM)<n>A graph convolutional embedding layer is then designed to extract primary features and refined dynamic view-specific graph structures, leveraging the global graph for imputation of missing views.
arXiv Detail & Related papers (2025-11-14T11:26:38Z) - G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge [88.82814893945077]
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge.<n>Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them.<n>G-reasoner is a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge.
arXiv Detail & Related papers (2025-09-29T04:38:12Z) - From Features to Structure: Task-Aware Graph Construction for Relational and Tabular Learning with GNNs [6.0757501646966965]
We introduce auGraph, a unified framework for task-aware graph augmentation.<n> auGraph enhances base graph structures by selectively promoting attributes into nodes.<n>It preserves the original data schema while injecting task-relevant structural signal.
arXiv Detail & Related papers (2025-06-02T20:42:53Z) - Beyond Message Passing: Neural Graph Pattern Machine [50.78679002846741]
We introduce the Neural Graph Pattern Machine (GPM), a novel framework that bypasses message passing by learning directly from graph substructures.<n>GPM efficiently extracts, encodes, and prioritizes task-relevant graph patterns, offering greater expressivity and improved ability to capture long-range dependencies.
arXiv Detail & Related papers (2025-01-30T20:37:47Z) - AutoG: Towards automatic graph construction from tabular data [60.877867570524884]
We aim to formalize the graph construction problem and propose an effective solution.<n>Existing automatic construction methods can only be applied to some specific cases.<n>We present a set of datasets to formalize and evaluate graph construction methods.<n>Second, we propose an LLM-based solution, AutoG, automatically generating high-quality graph schemas.
arXiv Detail & Related papers (2025-01-25T17:31:56Z) - Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive Experiments, Analysis, and Improvements [54.006506479865344]
We propose a unified evaluation framework for graph-level Graph Neural Networks (GNNs)<n>This framework provides a standardized setting to evaluate GNNs across diverse datasets.<n>We also propose a novel GNN model with enhanced expressivity and generalization capabilities.
arXiv Detail & Related papers (2025-01-01T08:48:53Z) - GraphCroc: Cross-Correlation Autoencoder for Graph Structural Reconstruction [6.817416560637197]
Graph autoencoders (GAEs) reconstruct graph structures from node embeddings.
We introduce a cross-correlation mechanism that significantly enhances the GAE representational capabilities.
We also propose GraphCroc, a new GAE that supports flexible encoder architectures tailored for various downstream tasks.
arXiv Detail & Related papers (2024-10-04T12:59:45Z) - Learning to Model Graph Structural Information on MLPs via Graph Structure Self-Contrasting [50.181824673039436]
We propose a Graph Structure Self-Contrasting (GSSC) framework that learns graph structural information without message passing.
The proposed framework is based purely on Multi-Layer Perceptrons (MLPs), where the structural information is only implicitly incorporated as prior knowledge.
It first applies structural sparsification to remove potentially uninformative or noisy edges in the neighborhood, and then performs structural self-contrasting in the sparsified neighborhood to learn robust node representations.
arXiv Detail & Related papers (2024-09-09T12:56:02Z) - GraphEdit: Large Language Models for Graph Structure Learning [14.16155596597421]
Graph Structure Learning (GSL) focuses on capturing intrinsic dependencies and interactions among nodes in graph-structured data.<n>Existing GSL methods heavily depend on explicit graph structural information as supervision signals.<n>We propose GraphEdit, an approach that leverages large language models (LLMs) to learn complex node relationships in graph-structured data.
arXiv Detail & Related papers (2024-02-23T08:29:42Z) - HUGE: Huge Unsupervised Graph Embeddings with TPUs [6.108914274067702]
Graph Embedding is a process of creating a continuous representation of nodes in a graph.
A high-performance graph embedding architecture leveraging amounts of high-bandwidth memory is presented.
We verify the embedding space quality on real and synthetic large-scale datasets.
arXiv Detail & Related papers (2023-07-26T20:29:15Z) - GraphMI: Extracting Private Graph Data from Graph Neural Networks [59.05178231559796]
We present textbfGraph textbfModel textbfInversion attack (GraphMI), which aims to extract private graph data of the training graph by inverting GNN.
Specifically, we propose a projected gradient module to tackle the discreteness of graph edges while preserving the sparsity and smoothness of graph features.
We design a graph auto-encoder module to efficiently exploit graph topology, node attributes, and target model parameters for edge inference.
arXiv Detail & Related papers (2021-06-05T07:07:52Z) - Learning Graph Structure With A Finite-State Automaton Layer [31.028101360041227]
We study the problem of learning to derive abstract relations from the intrinsic graph structure.
We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies.
We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs.
arXiv Detail & Related papers (2020-07-09T17:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.