DarkBERT: A Language Model for the Dark Side of the Internet
- URL: http://arxiv.org/abs/2305.08596v2
- Date: Thu, 18 May 2023 05:02:29 GMT
- Title: DarkBERT: A Language Model for the Dark Side of the Internet
- Authors: Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee,
Seungwon Shin
- Abstract summary: We introduce DarkBERT, a language model pretrained on Dark Web data.
We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web.
Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.
- Score: 26.28825428391132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has suggested that there are clear differences in the
language used in the Dark Web compared to that of the Surface Web. As studies
on the Dark Web commonly require textual analysis of the domain, language
models specific to the Dark Web may provide valuable insights to researchers.
In this work, we introduce DarkBERT, a language model pretrained on Dark Web
data. We describe the steps taken to filter and compile the text data used to
train DarkBERT to combat the extreme lexical and structural diversity of the
Dark Web that may be detrimental to building a proper representation of the
domain. We evaluate DarkBERT and its vanilla counterpart along with other
widely used language models to validate the benefits that a Dark Web domain
specific model offers in various use cases. Our evaluations show that DarkBERT
outperforms current language models and may serve as a valuable resource for
future research on the Dark Web.
Related papers
- Detecting Deceptive Dark Patterns in E-commerce Platforms [0.0]
Dark patterns are deceptive user interfaces employed by e-commerce websites to manipulate user's behavior in a way that benefits the website, often unethically.
Existing solutions include UIGuard, which uses computer vision and natural language processing, and approaches that categorize dark patterns based on detectability or utilize machine learning models trained on datasets.
We propose combining web scraping techniques with fine-tuned BERT language models and generative capabilities to identify dark patterns, including outliers.
arXiv Detail & Related papers (2024-05-27T16:32:40Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Why is the User Interface a Dark Pattern? : Explainable Auto-Detection
and its Analysis [1.4474137122906163]
Dark patterns are deceptive user interface designs for online services that make users behave in unintended ways.
We study interpretable dark pattern auto-detection, that is, why a particular user interface is detected as having dark patterns.
Our findings may prevent users from being manipulated by dark patterns, and aid in the construction of more equitable internet services.
arXiv Detail & Related papers (2023-12-30T03:53:58Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Linguistic Dead-Ends and Alphabet Soup: Finding Dark Patterns in
Japanese Apps [10.036312061637764]
We analyzed 200 popular mobile apps in the Japanese market.
We found that most apps had dark patterns, with an average of 3.9 per app.
We identified a new class of dark pattern: "Linguistic Dead-Ends" in the forms of "Untranslation" and "Alphabet Soup"
arXiv Detail & Related papers (2023-04-22T08:22:32Z) - VeriDark: A Large-Scale Benchmark for Authorship Verification on the
Dark Web [25.00969884543201]
We release VeriDark: a benchmark comprised of three large scale authorship verification datasets and one authorship identification dataset.
We evaluate competitive NLP baselines on the three datasets and perform an analysis of the predictions to better understand the limitations of such approaches.
arXiv Detail & Related papers (2022-07-07T17:57:11Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Shedding New Light on the Language of the Dark Web [28.203247249201535]
This paper introduces CoDA, a publicly available Dark Web dataset consisting of 10000 web documents tailored towards text-based analysis.
We conduct a thorough linguistic analysis of the Dark Web and examine the textual differences between the Dark Web and the Surface Web.
We also assess the performance of various methods of Dark Web page classification.
arXiv Detail & Related papers (2022-04-14T11:17:22Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.