Shedding New Light on the Language of the Dark Web
- URL: http://arxiv.org/abs/2204.06885v1
- Date: Thu, 14 Apr 2022 11:17:22 GMT
- Title: Shedding New Light on the Language of the Dark Web
- Authors: Youngjin Jin, Eugene Jang, Yongjae Lee, Seungwon Shin, Jin-Woo Chung
- Abstract summary: This paper introduces CoDA, a publicly available Dark Web dataset consisting of 10000 web documents tailored towards text-based analysis.
We conduct a thorough linguistic analysis of the Dark Web and examine the textual differences between the Dark Web and the Surface Web.
We also assess the performance of various methods of Dark Web page classification.
- Score: 28.203247249201535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The hidden nature and the limited accessibility of the Dark Web, combined
with the lack of public datasets in this domain, make it difficult to study its
inherent characteristics such as linguistic properties. Previous works on text
classification of Dark Web domain have suggested that the use of deep neural
models may be ineffective, potentially due to the linguistic differences
between the Dark and Surface Webs. However, not much work has been done to
uncover the linguistic characteristics of the Dark Web. This paper introduces
CoDA, a publicly available Dark Web dataset consisting of 10000 web documents
tailored towards text-based Dark Web analysis. By leveraging CoDA, we conduct a
thorough linguistic analysis of the Dark Web and examine the textual
differences between the Dark Web and the Surface Web. We also assess the
performance of various methods of Dark Web page classification. Finally, we
compare CoDA with an existing public Dark Web dataset and evaluate their
suitability for various use cases.
Related papers
- AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Integrating Dark Pattern Taxonomies [0.0]
Malicious and explotitative design has expanded to multiple domains in the past 10 years.
By leaning on network analysis tools and methods, this paper synthesizes existing elements through as a directed graph.
In doing so, the interconnectedness of Dark patterns can be more clearly revealed via community detection.
arXiv Detail & Related papers (2024-02-26T17:26:31Z) - DarkBERT: A Language Model for the Dark Side of the Internet [26.28825428391132]
We introduce DarkBERT, a language model pretrained on Dark Web data.
We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web.
Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.
arXiv Detail & Related papers (2023-05-15T12:23:10Z) - Linguistic Dead-Ends and Alphabet Soup: Finding Dark Patterns in
Japanese Apps [10.036312061637764]
We analyzed 200 popular mobile apps in the Japanese market.
We found that most apps had dark patterns, with an average of 3.9 per app.
We identified a new class of dark pattern: "Linguistic Dead-Ends" in the forms of "Untranslation" and "Alphabet Soup"
arXiv Detail & Related papers (2023-04-22T08:22:32Z) - ReDDIT: Regret Detection and Domain Identification from Text [62.997667081978825]
We present a novel dataset of Reddit texts that have been classified into three classes: Regret by Action, Regret by Inaction, and No Regret.
Our findings show that Reddit users are most likely to express regret for past actions, particularly in the domain of relationships.
arXiv Detail & Related papers (2022-12-14T23:41:57Z) - VeriDark: A Large-Scale Benchmark for Authorship Verification on the
Dark Web [25.00969884543201]
We release VeriDark: a benchmark comprised of three large scale authorship verification datasets and one authorship identification dataset.
We evaluate competitive NLP baselines on the three datasets and perform an analysis of the predictions to better understand the limitations of such approaches.
arXiv Detail & Related papers (2022-07-07T17:57:11Z) - TeKo: Text-Rich Graph Neural Networks with External Knowledge [75.91477450060808]
We propose a novel text-rich graph neural network with external knowledge (TeKo)
We first present a flexible heterogeneous semantic network that incorporates high-quality entities.
We then introduce two types of external knowledge, that is, structured triplets and unstructured entity description.
arXiv Detail & Related papers (2022-06-15T02:33:10Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial
Learning for Proactive Cyber Threat Intelligence [15.71648511138197]
Text-based CAPTCHA serves as the most prevalent and prohibiting type of anti-crawling measures in the dark web.
Existing automated CAPTCHA breaking methods have difficulties in overcoming dark web challenges.
We propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection.
arXiv Detail & Related papers (2022-01-08T09:53:31Z) - Lighting the Darkness in the Deep Learning Era [118.35081853500411]
Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination.
Recent advances in this area are dominated by deep learning-based solutions.
We provide a comprehensive survey to cover various aspects ranging from algorithm taxonomy to unsolved open issues.
arXiv Detail & Related papers (2021-04-21T19:12:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.