Related papers: Phishing Website Detection through Multi-Model Analysis of HTML Content

Phishing Website Detection through Multi-Model Analysis of HTML Content

URL: http://arxiv.org/abs/2401.04820v3
Date: Wed, 10 Jul 2024 10:47:07 GMT
Title: Phishing Website Detection through Multi-Model Analysis of HTML Content
Authors: Furkan Çolhak, Mert İlhan Ecevit, Bilal Emir Uçar, Reiner Creutzburg, Hasan Dağ,
Abstract summary: This study addresses the pressing issue of phishing by introducing an advanced detection model that meticulously focuses on HTML content. Our proposed approach integrates a specialized Multi-Layer Perceptron (MLP) model for structured tabular data and two pretrained Natural Language Processing (NLP) models for analyzing textual features. The fusion of two NLP and one model,termed MultiText-LP, achieves impressive results, yielding a 96.80 F1 score and a 97.18 accuracy score on our research dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The way we communicate and work has changed significantly with the rise of the Internet. While it has opened up new opportunities, it has also brought about an increase in cyber threats. One common and serious threat is phishing, where cybercriminals employ deceptive methods to steal sensitive information.This study addresses the pressing issue of phishing by introducing an advanced detection model that meticulously focuses on HTML content. Our proposed approach integrates a specialized Multi-Layer Perceptron (MLP) model for structured tabular data and two pretrained Natural Language Processing (NLP) models for analyzing textual features such as page titles and content. The embeddings from these models are harmoniously combined through a novel fusion process. The resulting fused embeddings are then input into a linear classifier. Recognizing the scarcity of recent datasets for comprehensive phishing research, our contribution extends to the creation of an up-to-date dataset, which we openly share with the community. The dataset is meticulously curated to reflect real-life phishing conditions, ensuring relevance and applicability. The research findings highlight the effectiveness of the proposed approach, with the CANINE demonstrating superior performance in analyzing page titles and the RoBERTa excelling in evaluating page content. The fusion of two NLP and one MLP model,termed MultiText-LP, achieves impressive results, yielding a 96.80 F1 score and a 97.18 accuracy score on our research dataset. Furthermore, our approach outperforms existing methods on the CatchPhish HTML dataset, showcasing its efficacies.

Related papers

MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection [0.0]
This paper presents MeAJOR, a novel, multi-source phishing email dataset.<n>It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails.<n>By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource.
arXiv Detail & Related papers (2025-07-23T22:57:08Z)
SMOTExT: SMOTE meets Large Language Models [19.394116388173885]
We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data.<n>Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples.<n>In early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset.
arXiv Detail & Related papers (2025-05-19T17:57:36Z)
Towards Effective Identification of Attack Techniques in Cyber Threat Intelligence Reports using Large Language Models [5.304267859042463]
This work evaluates the performance of Cyber Threat Intelligence (CTI) extraction methods in identifying attack techniques from threat reports available on the web.<n>We analyse four configurations utilising state-of-the-art tools, including the Threat Report ATT&CK Mapper (TRAM) and open-source Large Language Models (LLMs) such as Llama2.<n>Our findings reveal significant challenges, including class imbalance, overfitting, and domain-specific complexity, which impede accurate technique extraction.
arXiv Detail & Related papers (2025-05-06T03:43:12Z)
TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models [0.0]
This study explores a hybrid framework combining transformer-based models to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking.
arXiv Detail & Related papers (2025-04-14T05:44:11Z)
A Hybrid Attention Framework for Fake News Detection with Large Language Models [0.0]
We propose a novel framework to identify and classify fake news by integrating textual statistical features and deep semantic features. Our approach utilizes the contextual understanding capability of the large language model for text analysis. Our model significantly outperforms existing methods, with a 1.5% improvement in F1 score.
arXiv Detail & Related papers (2025-01-21T08:26:20Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
NLP-ADBench: NLP Anomaly Detection Benchmark [9.445800367013744]
We introduce NLP-ADBench, the most comprehensive benchmark for NLP anomaly detection. No single model excels across all datasets, highlighting the need for automated model selection. Two-step methods leveraging transformer-based embeddings consistently outperform specialized end-to-end approaches.
arXiv Detail & Related papers (2024-12-06T05:30:41Z)
Enhancing Authorship Attribution through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models [0.0]
We propose a novel framework with textual embeddings from Pre-trained Language Models to distinguish AI-generated and human-authored text. Our approach utilizes Embedding Fusion to integrate semantic information from multiple Language Models, harnessing their complementary strengths to enhance performance.
arXiv Detail & Related papers (2024-11-01T07:18:27Z)
Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection [16.21235742118949]
We propose a novel approach that repurposes a well-trained Vision-Language Models (VLMs) for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model. Our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.
arXiv Detail & Related papers (2024-09-04T12:46:30Z)
SecureNet: A Comparative Study of DeBERTa and Large Language Models for Phishing Detection [0.0]
Phishing is a major threat to organizations by using social engineering to trick users into revealing sensitive information. In this paper, we investigate whether the remarkable performance of Large Language Models (LLMs) can be leveraged for particular task like text classification. We demonstrate how LLMs can generate convincing phishing emails, making it harder to spot scams.
arXiv Detail & Related papers (2024-06-10T13:13:39Z)
A Sophisticated Framework for the Accurate Detection of Phishing Websites [0.0]
Phishing is an increasingly sophisticated form of cyberattack that is inflicting huge financial damage to corporations throughout the globe. This paper proposes a comprehensive methodology for detecting phishing websites. A combination of feature selection, greedy algorithm, cross-validation, and deep learning methods have been utilized to construct a sophisticated stacking ensemble.
arXiv Detail & Related papers (2024-03-13T14:26:25Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
DRFLM: Distributionally Robust Federated Learning with Inter-client Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data. We propose a general framework to solve the above two challenges simultaneously. We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.