Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with
Weakly Labeled Data
- URL: http://arxiv.org/abs/2010.04382v4
- Date: Tue, 22 Dec 2020 18:11:46 GMT
- Title: Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with
Weakly Labeled Data
- Authors: Perry Deng, Cooper Linsky, Matthew Wright
- Abstract summary: Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors.
We investigate a learning, transfer learning, and augmentation model to identify potential homoglyphs.
We also use our model to predict over 8,000 previously unknown homosglyph, and find good early indications that many may be true positives.
- Score: 11.434810426156877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visually similar characters, or homoglyphs, can be used to perform social
engineering attacks or to evade spam and plagiarism detectors. It is thus
important to understand the capabilities of an attacker to identify homoglyphs
-- particularly ones that have not been previously spotted -- and leverage them
in attacks. We investigate a deep-learning model using embedding learning,
transfer learning, and augmentation to determine the visual similarity of
characters and thereby identify potential homoglyphs. Our approach uniquely
takes advantage of weak labels that arise from the fact that most characters
are not homoglyphs. Our model drastically outperforms the Normalized
Compression Distance approach on pairwise homoglyph identification, for which
we achieve an average precision of 0.97. We also present the first attempt at
clustering homoglyphs into sets of equivalence classes, which is more efficient
than pairwise information for security practitioners to quickly lookup
homoglyphs or to normalize confusable string encodings. To measure clustering
performance, we propose a metric (mBIOU) building on the classic
Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592
mBIOU, compared to 0.430 for the naive baseline. We also use our model to
predict over 8,000 previously unknown homoglyphs, and find good early
indications that many of these may be true positives. Source code and list of
predicted homoglyphs are uploaded to Github:
https://github.com/PerryXDeng/weaponizing_unicode
Related papers
- Provably Secure Disambiguating Neural Linguistic Steganography [66.30965740387047]
The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures.
We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem.
SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods.
arXiv Detail & Related papers (2024-03-26T09:25:57Z) - Generation is better than Modification: Combating High Class Homophily Variance in Graph Anomaly Detection [51.11833609431406]
Homophily distribution differences between different classes are significantly greater than those in homophilic and heterophilic graphs.
We introduce a new metric called Class Homophily Variance, which quantitatively describes this phenomenon.
To mitigate its impact, we propose a novel GNN model named Homophily Edge Generation Graph Neural Network (HedGe)
arXiv Detail & Related papers (2024-03-15T14:26:53Z) - Pairwise Similarity Learning is SimPLE [104.14303849615496]
We focus on a general yet important learning problem, pairwise similarity learning (PSL)
PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification.
We propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin.
arXiv Detail & Related papers (2023-10-13T23:56:47Z) - GlyphNet: Homoglyph domains dataset and detection using attention-based
Convolutional Neural Networks [1.0312968200748118]
Homoglyph attacks create illegitimate domains that are hard to distinguish from legit ones.
Existing approaches use simple, string-based comparison techniques applied in primary language-based tasks.
We show that our model can reach state-of-the-art accuracy in detecting homoglyph attacks with a 0.93 AUC on our dataset.
arXiv Detail & Related papers (2023-06-17T17:16:53Z) - Non-contrastive representation learning for intervals from well logs [58.70164460091879]
The representation learning problem in the oil & gas industry aims to construct a model that provides a representation based on logging data for a well interval.
One of the possible approaches is self-supervised learning (SSL)
We are the first to introduce non-contrastive SSL for well-logging data.
arXiv Detail & Related papers (2022-09-28T13:27:10Z) - Leveraging Dependency Grammar for Fine-Grained Offensive Language
Detection using Graph Convolutional Networks [0.5457150493905063]
We address the problem of offensive language detection on Twitter.
We propose a novel approach called SyLSTM, which integrates syntactic features in the form of the dependency parse tree of a sentence.
Results show that the proposed approach significantly outperforms the state-of-the-art BERT model with orders of magnitude fewer number of parameters.
arXiv Detail & Related papers (2022-05-26T05:27:50Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z) - New Benchmarks for Learning on Non-Homophilous Graphs [20.082182515715182]
We present a series of improved graph datasets with node label relationships that do not satisfy the homophily principle.
We also introduce a new measure of the presence or absence of homophily that is better suited than existing measures in different regimes.
arXiv Detail & Related papers (2021-04-03T13:45:06Z) - PhishGAN: Data Augmentation and Identification of Homoglpyh Attacks [0.0]
Homoglyph attacks are a common technique used by hackers to conduct phishing. Domain names or links that are visually similar to actual ones are created via punycode to obfuscate the attack.
Here, we show how a conditional Generative Adversarial Network (GAN), PhishGAN, can be used to generate images of hieroglyphs.
arXiv Detail & Related papers (2020-06-24T13:59:09Z) - FixMatch: Simplifying Semi-Supervised Learning with Consistency and
Confidence [93.91751021370638]
Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance.
In this paper, we demonstrate the power of a simple combination of two common SSL methods: consistency regularization and pseudo-labeling.
Our algorithm, FixMatch, first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images.
arXiv Detail & Related papers (2020-01-21T18:32:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.