Euphemistic Phrase Detection by Masked Language Model
- URL: http://arxiv.org/abs/2109.04666v1
- Date: Fri, 10 Sep 2021 04:57:30 GMT
- Title: Euphemistic Phrase Detection by Masked Language Model
- Authors: Wanzheng Zhu, Suma Bhat
- Abstract summary: We perform phrase mining on a social media corpus to extract quality phrases.
Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates.
We report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
- Score: 9.49544185939481
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It is a well-known approach for fringe groups and organizations to use
euphemisms -- ordinary-sounding and innocent-looking words with a secret
meaning -- to conceal what they are discussing. For instance, drug dealers
often use "pot" for marijuana and "avocado" for heroin. From a social media
content moderation perspective, though recent advances in NLP have enabled the
automatic detection of such single-word euphemisms, no existing work is capable
of automatically detecting multi-word euphemisms, such as "blue dream"
(marijuana) and "black tar" (heroin). Our paper tackles the problem of
euphemistic phrase detection without human effort for the first time, as far as
we are aware. We first perform phrase mining on a raw text corpus (e.g., social
media posts) to extract quality phrases. Then, we utilize word embedding
similarities to select a set of euphemistic phrase candidates. Finally, we rank
those candidates by a masked language model -- SpanBERT. Compared to strong
baselines, we report 20-50% higher detection accuracies using our algorithm for
detecting euphemistic phrases.
Related papers
- Impromptu Cybercrime Euphemism Detection [20.969469059941545]
We introduce the Impromptu Cybercrime Euphemisms Detection dataset.
We propose a detection framework tailored to this problem.
Our approach achieves a remarkable 76-fold improvement compared to the previous state-of-the-art euphemism detector.
arXiv Detail & Related papers (2024-12-02T11:56:06Z) - Towards Effective Paraphrasing for Information Disguise [13.356934367660811]
Research on Information Disguise (ID) becomes important when authors' written online communication pertains to sensitive domains.
We propose a framework where, for a given sentence from an author's post, we perform iterative perturbation on the sentence in the direction of paraphrasing.
Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search.
arXiv Detail & Related papers (2023-11-08T21:12:59Z) - Biomedical Named Entity Recognition via Dictionary-based Synonym
Generalization [51.89486520806639]
We propose a novel Synonym Generalization (SynGen) framework that recognizes the biomedical concepts contained in the input text using span-based predictions.
We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins.
arXiv Detail & Related papers (2023-05-22T14:36:32Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Self-Supervised Euphemism Detection and Identification for Content
Moderation [16.322965299627974]
One common use of euphemisms is to evade content moderation policies enforced by social media platforms.
It is usually apparent to a human moderator that a word is being used euphemistically, but they may not know what the secret meaning is.
This paper will demonstrate unsupervised algorithms that can both detect words being used euphemistically, and identify the secret meaning of each word.
arXiv Detail & Related papers (2021-03-31T04:52:38Z) - Towards Dark Jargon Interpretation in Underground Forums [37.15748678894555]
We present a novel method towards automatically identifying and interpreting dark jargons.
We formalize the problem as a mapping from dark words to "clean" words with no hidden meaning.
Our method makes use of interpretable representations of dark and clean words in the form of probability distributions over a shared vocabulary.
arXiv Detail & Related papers (2020-11-05T18:08:32Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z) - Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning [29.181547214915238]
We show that an attacker can control the "meaning" of new and existing words by changing their locations in the embedding space.
An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios.
arXiv Detail & Related papers (2020-01-14T17:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.