CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection
- URL: http://arxiv.org/abs/2602.09015v3
- Date: Wed, 11 Feb 2026 14:56:48 GMT
- Title: CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection
- Authors: Fatemeh Nejati, Mahdi Rabbani, Morteza Eskandarian, Mansur Mirani, Gunjan Piya, Igor Opushnyev, Ali A. Ghorbani, Sajjad Dadkhah,
- Abstract summary: Phishing attacks represent one of the primary attack methods used by cyber attackers.<n> CIC-Trap4Phish dataset contains both malicious and benign samples across five categories commonly used in phishing campaigns.
- Score: 35.21543593148398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phishing attacks represents one of the primary attack methods which is used by cyber attackers. In many cases, attackers use deceptive emails along with malicious attachments to trick users into giving away sensitive information or installing malware while compromising entire systems. The flexibility of malicious email attachments makes them stand out as a preferred vector for attackers as they can embed harmful content such as malware or malicious URLs inside standard document formats. Although phishing email defenses have improved a lot, attackers continue to abuse attachments, enabling malicious content to bypass security measures. Moreover, another challenge that researches face in training advance models, is lack of an unified and comprehensive dataset that covers the most prevalent data types. To address this gap, we generated CIC-Trap4Phish, a multi-format dataset containing both malicious and benign samples across five categories commonly used in phishing campaigns: Microsoft Word documents, Excel spreadsheets, PDF files, HTML pages, and QR code images. For the first four file types, a set of execution-free static feature pipeline was proposed, designed to capture structural, lexical, and metadata-based indicators without the need to open or execute files. Feature selection was performed using a combination of SHAP analysis and feature importance, yielding compact, discriminative feature subsets for each file type. The selected features were evaluated by using lightweight machine learning models, including Random Forest, XGBoost, and Decision Tree. All models demonstrate high detection accuracy across formats. For QR code-based phishing (quishing), two complementary methods were implemented: image-based detection by employing Convolutional Neural Networks (CNNs) and lexical analysis of decoded URLs using recent lightweight language models.
Related papers
- BlackCATT: Black-box Collusion Aware Traitor Tracing in Federated Learning [51.251962154210474]
We present a general collusion-resistant embedding method for black-box traitor tracing in Federated Learning: BlackCATT.<n> Experimental results confirm the efficacy of the proposed scheme across different architectures and datasets.<n>For models that would otherwise suffer from update incompatibility on the main task, our proposed BlackCATT+FR incorporates functional regularization.
arXiv Detail & Related papers (2026-02-12T16:26:57Z) - Characterizing Phishing Pages by JavaScript Capabilities [77.64740286751834]
This paper aims to aid researchers and analysts by automatically differentiating groups of phishing pages based on the underlying kit.<n>For kit detection, our system has an accuracy of 97% on a ground-truth dataset of 548 kit families deployed across 4,562 phishing URLs.<n>We find that UI interactivity and basic fingerprinting are universal techniques, present in 90% and 80% of the clusters.
arXiv Detail & Related papers (2025-09-16T15:39:23Z) - Bridging the Gap in Phishing Detection: A Comprehensive Phishing Dataset Collector [0.030786914102688596]
This paper introduces a resource collection tool designed to gather various resources associated with a URL, such as CSS, Javascript, favicons, webpage images, and screenshots.<n>We share a sample dataset generated using our tool comprising 4,056 legitimate and 5,666 phishing URLs along with their associated resources.
arXiv Detail & Related papers (2025-09-11T16:30:12Z) - MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection [1.554831836850549]
This paper presents MeAJOR, a novel, multi-source phishing email dataset.<n>It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails.<n>By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource.
arXiv Detail & Related papers (2025-07-23T22:57:08Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection [26.106113544525545]
Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information.<n>Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations.<n>In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs)<n>This combination leads to broader brand coverage, which enhances brand recognition and recall.
arXiv Detail & Related papers (2024-08-20T11:14:21Z) - From ML to LLM: Evaluating the Robustness of Phishing Webpage Detection Models against Adversarial Attacks [0.8050163120218178]
Phishing attacks attempt to deceive users into stealing sensitive information, posing a significant cybersecurity threat.<n>We develop PhishOracle, a tool that generates adversarial phishing webpages by embedding diverse phishing features into legitimate webpages.<n>Our findings highlight the vulnerability of phishing detection models to adversarial attacks, emphasizing the need for more robust detection approaches.
arXiv Detail & Related papers (2024-07-29T18:21:34Z) - Mitigating Bias in Machine Learning Models for Phishing Webpage Detection [0.8050163120218178]
Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs.
Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models.
This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets.
We propose a potential solution in the form of a tool engineered to alleviate bias in ML models.
arXiv Detail & Related papers (2024-01-16T13:45:54Z) - An Adversarial Attack Analysis on Malicious Advertisement URL Detection
Framework [22.259444589459513]
Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks.
Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data.
In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection.
arXiv Detail & Related papers (2022-04-27T20:06:22Z) - Adversarial EXEmples: A Survey and Experimental Evaluation of Practical
Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes.
We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks.
These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z) - Detecting malicious PDF using CNN [46.86114958340962]
Malicious PDF files represent one of the biggest threats to computer security.
We propose a novel algorithm that uses an ensemble of Convolutional Neural Network (CNN) on the byte level of the file.
We show, using a data set of 90000 files downloadable online, that our approach maintains a high detection rate (94%) of PDF malware.
arXiv Detail & Related papers (2020-07-24T18:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.