Bridging the Semantic Gap in Virtual Machine Introspection and Forensic Memory Analysis
- URL: http://arxiv.org/abs/2503.05482v1
- Date: Fri, 07 Mar 2025 14:51:32 GMT
- Title: Bridging the Semantic Gap in Virtual Machine Introspection and Forensic Memory Analysis
- Authors: Christofer Fellicious, Hans P. Reiser, Michael Granitzer,
- Abstract summary: "Semantic Gap" is the difficulty of interpreting raw memory data without specialized tools and expertise.<n>We investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA.<n>Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%.
- Score: 0.6372911857214884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Forensic Memory Analysis (FMA) and Virtual Machine Introspection (VMI) are critical tools for security in a virtualization-based approach. VMI and FMA involves using digital forensic methods to extract information from the system to identify and explain security incidents. A key challenge in both FMA and VMI is the "Semantic Gap", which is the difficulty of interpreting raw memory data without specialized tools and expertise. In this work, we investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA, leveraging machine learning to automate information extraction and reduce the workload of forensic investigators. We choose OpenSSH as our use case to test different methods to extract high level structures. We also test our method on complete physical memory dumps to showcase the effectiveness of the engineered features. Our features range from basic statistical features to advanced graph-based representations using malloc headers and pointer translations. The training and testing are carried out on public datasets that we compare against already recognized baseline methods. We show that using metadata, we can improve the performance of the algorithm when there is very little training data and also quantify how having more data results in better generalization performance. The final contribution is an open dataset of physical memory dumps, totalling more than 1 TB of different memory state, software environments, main memory capacities and operating system versions. Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%. Our research underscores the possibility of using feature engineering and machine learning techniques to bridge the semantic gap.
Related papers
- Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions [55.19217798774033]
Memory is a fundamental component of AI systems, underpinning large language models (LLMs) based agents.
We introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression.
This survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI.
arXiv Detail & Related papers (2025-05-01T17:31:33Z) - A Comprehensive Quantification of Inconsistencies in Memory Dumps [13.796554685139855]
We develop a system to track all write operations performed by the OS kernel during a memory acquisition process.
We quantify how different acquisition modes, file systems, and hardware targets influence the frequency of kernel writes during the dump.
arXiv Detail & Related papers (2025-03-19T10:02:54Z) - Scalability of memorization-based machine unlearning [2.5782420501870296]
Machine unlearning (MUL) focuses on removing the influence of specific subsets of data from pretrained models.
Memorization-based unlearning methods have been developed, demonstrating exceptional performance with respect to unlearning quality.
We tackle these scalability challenges of state-of-the-art memorization-based MUL algorithms using a series of memorization-score proxies.
arXiv Detail & Related papers (2024-10-21T21:18:39Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains.
Recent PETL works focus on the more valuable memory-efficient characteristic.
We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Learning to Rank Graph-based Application Objects on Heterogeneous
Memories [0.0]
This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application's performance.
By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline's approach.
arXiv Detail & Related papers (2022-11-04T00:20:31Z) - Memory Wrap: a Data-Efficient and Interpretable Extension to Image
Classification Models [9.848884631714451]
Memory Wrap is a plug-and-play extension to any image classification model.
It improves both data-efficiency and model interpretability, adopting a content-attention mechanism.
We show that Memory Wrap outperforms standard classifiers when it learns from a limited set of data.
arXiv Detail & Related papers (2021-06-01T07:24:19Z) - Shared Space Transfer Learning for analyzing multi-site fMRI data [83.41324371491774]
Multi-voxel pattern analysis (MVPA) learns predictive models from task-based functional magnetic resonance imaging (fMRI) data.
MVPA works best with a well-designed feature set and an adequate sample size.
Most fMRI datasets are noisy, high-dimensional, expensive to collect, and with small sample sizes.
This paper proposes the Shared Space Transfer Learning (SSTL) as a novel transfer learning approach.
arXiv Detail & Related papers (2020-10-24T08:50:26Z) - Federated Doubly Stochastic Kernel Learning for Vertically Partitioned
Data [93.76907759950608]
We propose a doubly kernel learning algorithm for vertically partitioned data.
We show that FDSKL is significantly faster than state-of-the-art federated learning methods when dealing with kernels.
arXiv Detail & Related papers (2020-08-14T05:46:56Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.