Related papers: Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with Weakly Labeled Data

Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with Weakly Labeled Data

URL: http://arxiv.org/abs/2010.04382v4
Date: Tue, 22 Dec 2020 18:11:46 GMT
Title: Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with Weakly Labeled Data
Authors: Perry Deng, Cooper Linsky, Matthew Wright
Abstract summary: Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. We investigate a learning, transfer learning, and augmentation model to identify potential homoglyphs. We also use our model to predict over 8,000 previously unknown homosglyph, and find good early indications that many may be true positives.
Score: 11.434810426156877
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs -- particularly ones that have not been previously spotted -- and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: https://github.com/PerryXDeng/weaponizing_unicode

Related papers

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets. They learn unintended correlations between semantic concepts and unrelated visual signals. Prior work has weaponized these correlations as an attack vector to manipulate model predictions. We introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z)
Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP. We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z)
Provably Secure Disambiguating Neural Linguistic Steganography [66.30965740387047]
The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods.
arXiv Detail & Related papers (2024-03-26T09:25:57Z)
Generation is better than Modification: Combating High Class Homophily Variance in Graph Anomaly Detection [51.11833609431406]
Homophily distribution differences between different classes are significantly greater than those in homophilic and heterophilic graphs. We introduce a new metric called Class Homophily Variance, which quantitatively describes this phenomenon. To mitigate its impact, we propose a novel GNN model named Homophily Edge Generation Graph Neural Network (HedGe)
arXiv Detail & Related papers (2024-03-15T14:26:53Z)
Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences. We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision. Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z)
Pairwise Similarity Learning is SimPLE [104.14303849615496]
We focus on a general yet important learning problem, pairwise similarity learning (PSL) PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. We propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin.
arXiv Detail & Related papers (2023-10-13T23:56:47Z)
GlyphNet: Homoglyph domains dataset and detection using attention-based Convolutional Neural Networks [1.0312968200748118]
Homoglyph attacks create illegitimate domains that are hard to distinguish from legit ones. Existing approaches use simple, string-based comparison techniques applied in primary language-based tasks. We show that our model can reach state-of-the-art accuracy in detecting homoglyph attacks with a 0.93 AUC on our dataset.
arXiv Detail & Related papers (2023-06-17T17:16:53Z)
Improving Deep Representation Learning via Auxiliary Learnable Target Coding [69.79343510578877]
This paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning. Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations.
arXiv Detail & Related papers (2023-05-30T01:38:54Z)
Leveraging Dependency Grammar for Fine-Grained Offensive Language Detection using Graph Convolutional Networks [0.5457150493905063]
We address the problem of offensive language detection on Twitter. We propose a novel approach called SyLSTM, which integrates syntactic features in the form of the dependency parse tree of a sentence. Results show that the proposed approach significantly outperforms the state-of-the-art BERT model with orders of magnitude fewer number of parameters.
arXiv Detail & Related papers (2022-05-26T05:27:50Z)
New Benchmarks for Learning on Non-Homophilous Graphs [20.082182515715182]
We present a series of improved graph datasets with node label relationships that do not satisfy the homophily principle. We also introduce a new measure of the presence or absence of homophily that is better suited than existing measures in different regimes.
arXiv Detail & Related papers (2021-04-03T13:45:06Z)
PhishGAN: Data Augmentation and Identification of Homoglpyh Attacks [0.0]
Homoglyph attacks are a common technique used by hackers to conduct phishing. Domain names or links that are visually similar to actual ones are created via punycode to obfuscate the attack. Here, we show how a conditional Generative Adversarial Network (GAN), PhishGAN, can be used to generate images of hieroglyphs.
arXiv Detail & Related papers (2020-06-24T13:59:09Z)
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence [93.91751021370638]
Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model's performance. In this paper, we demonstrate the power of a simple combination of two common SSL methods: consistency regularization and pseudo-labeling. Our algorithm, FixMatch, first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images.
arXiv Detail & Related papers (2020-01-21T18:32:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.