Bridging the Gap in Phishing Detection: A Comprehensive Phishing Dataset Collector
- URL: http://arxiv.org/abs/2509.09592v1
- Date: Thu, 11 Sep 2025 16:30:12 GMT
- Title: Bridging the Gap in Phishing Detection: A Comprehensive Phishing Dataset Collector
- Authors: Aditya Kulkarni, Shahil Manishbhai Patel, Shivam Pradip Tirmare, Vivek Balachandran, Tamal Das,
- Abstract summary: This paper introduces a resource collection tool designed to gather various resources associated with a URL, such as CSS, Javascript, favicons, webpage images, and screenshots.<n>We share a sample dataset generated using our tool comprising 4,056 legitimate and 5,666 phishing URLs along with their associated resources.
- Score: 0.030786914102688596
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To combat phishing attacks -- aimed at luring web users to divulge their sensitive information -- various phishing detection approaches have been proposed. As attackers focus on devising new tactics to bypass existing detection solutions, researchers have adapted by integrating machine learning and deep learning into phishing detection. Phishing dataset collection is vital to developing effective phishing detection approaches, which highly depend on the diversity of the gathered datasets. The lack of diversity in the dataset results in a biased model. Since phishing websites are often short-lived, collecting them is also a challenge. Consequently, very few phishing webpage dataset repositories exist to date. No single repository comprehensively consolidates all phishing elements corresponding to a phishing webpage, namely, URL, webpage source code, screenshot, and related webpage resources. This paper introduces a resource collection tool designed to gather various resources associated with a URL, such as CSS, Javascript, favicons, webpage images, and screenshots. Our tool leverages PhishTank as the primary source for obtaining active phishing URLs. Our tool fetches several additional webpage resources compared to PyWebCopy Python library, which provides webpage content for a given URL. Additionally, we share a sample dataset generated using our tool comprising 4,056 legitimate and 5,666 phishing URLs along with their associated resources. We also remark on the top correlated phishing features with their associated class label found in our dataset. Our tool offers a comprehensive resource set that can aid researchers in developing effective phishing detection approaches.
Related papers
- CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection [35.21543593148398]
Phishing attacks represent one of the primary attack methods used by cyber attackers.<n> CIC-Trap4Phish dataset contains both malicious and benign samples across five categories commonly used in phishing campaigns.
arXiv Detail & Related papers (2026-02-09T18:57:00Z) - WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents [45.87204751555924]
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones.<n>Existing methods for detecting and localizing such attacks achieve limited effectiveness.<n>We propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages.
arXiv Detail & Related papers (2026-02-03T17:55:04Z) - Characterizing Phishing Pages by JavaScript Capabilities [77.64740286751834]
This paper aims to aid researchers and analysts by automatically differentiating groups of phishing pages based on the underlying kit.<n>For kit detection, our system has an accuracy of 97% on a ground-truth dataset of 548 kit families deployed across 4,562 phishing URLs.<n>We find that UI interactivity and basic fingerprinting are universal techniques, present in 90% and 80% of the clusters.
arXiv Detail & Related papers (2025-09-16T15:39:23Z) - Phish-Blitz: Advancing Phishing Detection with Comprehensive Webpage Resource Collection and Visual Integrity Preservation [0.03262230127283452]
We introduce Phish-Blitz, a tool that downloads phishing and legitimate webpages along with their associated resources, such as screenshots.<n>Unlike existing tools, Phish-Blitz captures live webpage screenshots and updates resource file paths to maintain the original visual integrity of the webpage.<n>We provide a dataset containing 8,809 legitimate and 5,000 phishing webpages, including all associated resources.
arXiv Detail & Related papers (2025-09-10T08:13:49Z) - Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AI [0.0]
Phishing has been a prevalent cyber threat that manipulates users into revealing sensitive private information through deceptive tactics.
proactively detection of phishing URLs (or websites) has been established as an widely-accepted defense approach.
We analyze two publicly available phishing URL datasets, where each dataset has its own set of unique and overlapping features related to URL string and website contents.
arXiv Detail & Related papers (2024-11-14T21:07:52Z) - From ML to LLM: Evaluating the Robustness of Phishing Webpage Detection Models against Adversarial Attacks [0.8050163120218178]
Phishing attacks attempt to deceive users into stealing sensitive information, posing a significant cybersecurity threat.<n>We develop PhishOracle, a tool that generates adversarial phishing webpages by embedding diverse phishing features into legitimate webpages.<n>Our findings highlight the vulnerability of phishing detection models to adversarial attacks, emphasizing the need for more robust detection approaches.
arXiv Detail & Related papers (2024-07-29T18:21:34Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - KnowPhish: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing Reference-Based Phishing Detection [36.014171641453615]
We propose an automated knowledge collection pipeline, containing 20k brands with rich information about each brand.
KnowPhish can be used to boost the performance of existing reference-based phishing detectors.
Our resulting multimodal phishing detection approach, KnowPhish Detector, can detect phishing webpages with or without logos.
arXiv Detail & Related papers (2024-03-04T17:38:32Z) - Prompted Contextual Vectors for Spear-Phishing Detection [41.26408609344205]
Spear-phishing attacks present a significant security challenge.<n>We propose a detection approach based on a novel document vectorization method.<n>Our method achieves a 91% F1 score in identifying LLM-generated spear-phishing emails.
arXiv Detail & Related papers (2024-02-13T09:12:55Z) - Mitigating Bias in Machine Learning Models for Phishing Webpage Detection [0.8050163120218178]
Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs.
Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models.
This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets.
We propose a potential solution in the form of a tool engineered to alleviate bias in ML models.
arXiv Detail & Related papers (2024-01-16T13:45:54Z) - Deep convolutional forest: a dynamic deep ensemble approach for spam
detection in text [219.15486286590016]
This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically.
As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
arXiv Detail & Related papers (2021-10-10T17:19:37Z) - Phishing and Spear Phishing: examples in Cyber Espionage and techniques
to protect against them [91.3755431537592]
Phishing attacks have become the most used technique in the online scams, initiating more than 91% of cyberattacks, from 2012 onwards.
This study reviews how Phishing and Spear Phishing attacks are carried out by the phishers, through 5 steps which magnify the outcome.
arXiv Detail & Related papers (2020-05-31T18:10:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.