ForensicsData: A Digital Forensics Dataset for Large Language Models
- URL: http://arxiv.org/abs/2509.05331v1
- Date: Sun, 31 Aug 2025 19:58:24 GMT
- Title: ForensicsData: A Digital Forensics Dataset for Large Language Models
- Authors: Youssef Chakir, Iyad Lahsen-Cherif,
- Abstract summary: ForensicsData is an extensive Question-Context-Answer (Q-C-A) dataset sourced from actual malware analysis reports.<n>A unique workflow was used to create the dataset, which extracts structured data.<n> Gemini 2 Flash demonstrated the best performance in aligning generated content with forensic terminology.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing complexity of cyber incidents presents significant challenges for digital forensic investigators, especially in evidence collection and analysis. Public resources are still limited because of ethical, legal, and privacy concerns, even though realistic datasets are necessary to support research and tool developments. To address this gap, we introduce ForensicsData, an extensive Question-Context-Answer (Q-C-A) dataset sourced from actual malware analysis reports. It consists of more than 5,000 Q-C-A triplets. A unique workflow was used to create the dataset, which extracts structured data, uses large language models (LLMs) to transform it into Q-C-A format, and then uses a specialized evaluation process to confirm its quality. Among the models evaluated, Gemini 2 Flash demonstrated the best performance in aligning generated content with forensic terminology. ForensicsData aims to advance digital forensics by enabling reproducible experiments and fostering collaboration within the research community.
Related papers
- Evaluating the Reliability of Digital Forensic Evidence Discovered by Large Language Model: A Case Study [1.7102309907119588]
This paper proposes a structured framework that automates forensic artifact extraction, refines data through large language models (LLMs) analysis, and validates results using a Digital Forensic Knowledge Graph (DFKG)<n> evaluated on a 13 GB forensic image dataset containing 61 applications, 2,864 databases, and 5,870 tables.<n>Case study shows framework's effectiveness, achieving over 95 percent accuracy in artifact extraction, strong support of chain-of-custody adherence, and robust contextual consistency in forensic relationships.
arXiv Detail & Related papers (2026-02-22T18:20:49Z) - AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z) - DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response [0.0]
Large Language Models (LLMs) offer new opportunities in Digital Forensics and Incident Response (DFIR)<n>LLMs offer new opportunities in DFIR tasks such as log analysis and memory, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts.<n>We present DFIR-Metric, a benchmark to evaluate LLMs across both theoretical and practical DFIR domains.
arXiv Detail & Related papers (2025-05-26T13:35:37Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Decoding MIE: A Novel Dataset Approach Using Topic Extraction and Affiliation Parsing [0.0]
This study introduces a novel dataset derived from the Medical Informatics Europe (MIE) Conference proceedings.
We extracted and processed metadata and abstract from 4,606 articles published in the "Studies in Health Technology and Informatics" journal series.
arXiv Detail & Related papers (2024-10-06T19:34:23Z) - GenDFIR: Advancing Cyber Incident Timeline Analysis Through Retrieval Augmented Generation and Large Language Models [0.08192907805418582]
Cyber timeline analysis is crucial in Digital Forensics and Incident Response (DFIR)<n>Traditional methods rely on structured artefacts, such as logs and metadata, for evidence identification and feature extraction.<n>This paper introduces GenDFIR, a framework leveraging large language models (LLMs), specifically Llama 3.1 8B in zero shot mode, integrated with a Retrieval-Augmented Generation (RAG) agent.
arXiv Detail & Related papers (2024-09-04T09:46:33Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed.<n>We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA)<n>Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Integration of Domain Expert-Centric Ontology Design into the CRISP-DM for Cyber-Physical Production Systems [45.05372822216111]
Methods from Machine Learning (ML) and Data Mining (DM) have proven to be promising in extracting complex and hidden patterns from the data collected.
However, such data-driven projects, usually performed with the Cross-Industry Standard Process for Data Mining (CRISPDM), often fail due to the disproportionate amount of time needed for understanding and preparing the data.
This contribution intends present an integrated approach so that data scientists are able to more quickly and reliably gain insights into the CPPS challenges.
arXiv Detail & Related papers (2023-07-21T15:04:00Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Automated Artefact Relevancy Determination from Artefact Metadata and
Associated Timeline Events [7.219077740523683]
Case-hindering, multi-year digital forensic evidence backlogs have become commonplace in law enforcement agencies throughout the world.
This is due to an ever-growing number of cases requiring digital forensic investigation coupled with the growing volume of data to be processed per case.
Leveraging previously processed digital forensic cases and their component artefact relevancy classifications can facilitate an opportunity for training automated artificial intelligence based evidence processing systems.
arXiv Detail & Related papers (2020-12-02T14:14:26Z) - Visilant: Visual Support for the Exploration and Analytical Process
Tracking in Criminal Investigations [1.8594711725515676]
Visilant is a web-based tool for the exploration and analysis of criminal data guided by the proposed design.
The tool was evaluated by senior criminology experts within two sessions and their feedback is summarized in the paper.
arXiv Detail & Related papers (2020-09-21T09:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.