Towards Dark Jargon Interpretation in Underground Forums
- URL: http://arxiv.org/abs/2011.03011v2
- Date: Mon, 11 Jan 2021 00:32:12 GMT
- Title: Towards Dark Jargon Interpretation in Underground Forums
- Authors: Dominic Seyler and Wei Liu and XiaoFeng Wang and ChengXiang Zhai
- Abstract summary: We present a novel method towards automatically identifying and interpreting dark jargons.
We formalize the problem as a mapping from dark words to "clean" words with no hidden meaning.
Our method makes use of interpretable representations of dark and clean words in the form of probability distributions over a shared vocabulary.
- Score: 37.15748678894555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dark jargons are benign-looking words that have hidden, sinister meanings and
are used by participants of underground forums for illicit behavior. For
example, the dark term "rat" is often used in lieu of "Remote Access Trojan".
In this work we present a novel method towards automatically identifying and
interpreting dark jargons. We formalize the problem as a mapping from dark
words to "clean" words with no hidden meaning. Our method makes use of
interpretable representations of dark and clean words in the form of
probability distributions over a shared vocabulary. In our experiments we show
our method to be effective in terms of dark jargon identification, as it
outperforms another related method on simulated data. Using manual evaluation,
we show that our method is able to detect dark jargons in a real-world
underground forum dataset.
Related papers
- How Contentious Terms About People and Cultures are Used in Linked Open
Data [0.0]
When outdated and culturally stereotyping terminology is used in literals, they may appear as offensive to users in interfaces and propagate stereotypes to algorithms trained on them.
We study how frequently and in which literals contentious terms about people and cultures occur in linked open data (LOD)
We inspect occurrences of these terms in four widely used datasets: Wikidata, The Getty Art & Architecture Thesaurus, Princeton WordNet, and Open Dutch WordNet.
arXiv Detail & Related papers (2023-11-13T18:25:20Z) - Using meaning instead of words to track topics [0.76146285961466]
Currently, all existing topic tracking methods use lexical information by matching word usage.
We explore a novel semantic-based method using word embeddings.
Our results show that a semantic-based approach to topic tracking is on par with the lexical approach but makes different mistakes.
arXiv Detail & Related papers (2023-01-02T08:55:55Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - Discovering the Hidden Vocabulary of DALLE-2 [96.19666636109729]
We find that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts.
For example, it seems that textttApoploe vesrreaitais means birds and textttContarra ccetnxniams luryca tanniounons (sometimes) means bugs or pests.
arXiv Detail & Related papers (2022-06-01T01:14:48Z) - Euphemistic Phrase Detection by Masked Language Model [9.49544185939481]
We perform phrase mining on a social media corpus to extract quality phrases.
Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates.
We report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
arXiv Detail & Related papers (2021-09-10T04:57:30Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Self-Supervised Euphemism Detection and Identification for Content
Moderation [16.322965299627974]
One common use of euphemisms is to evade content moderation policies enforced by social media platforms.
It is usually apparent to a human moderator that a word is being used euphemistically, but they may not know what the secret meaning is.
This paper will demonstrate unsupervised algorithms that can both detect words being used euphemistically, and identify the secret meaning of each word.
arXiv Detail & Related papers (2021-03-31T04:52:38Z) - Sent2Matrix: Folding Character Sequences in Serpentine Manifolds for
Two-Dimensional Sentence [54.6266741821988]
We propose to convert texts into 2-D representations and develop the Sent2Matrix method.
Our method allows for the explicit incorporation of both word morphologies and boundaries.
Notably, our method is the first attempt to represent texts in 2-D formats.
arXiv Detail & Related papers (2021-03-15T13:52:47Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.