Automatic Generation of Web Censorship Probe Lists
- URL: http://arxiv.org/abs/2407.08185v1
- Date: Thu, 11 Jul 2024 05:04:52 GMT
- Title: Automatic Generation of Web Censorship Probe Lists
- Authors: Jenny Tang, Leo Alvarez, Arjun Brar, Nguyen Phong Hoang, Nicolas Christin,
- Abstract summary: Previous efforts to generate domain probe lists have been mostly manual or crowdsourced.
This paper explores methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement.
- Score: 6.051603326423421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Domain probe lists--used to determine which URLs to probe for Web censorship--play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from these URLs (i.e., performing topic and keyword extraction), expanding these topics, and using them as a feed to search engines, our method produces 119,255 new URLs across 35,147 domains. We then test the new candidate pages by attempting to access each URL from servers in eleven different global locations over a span of four months to check for their connectivity and potential signs of censorship. Our measurements reveal that our method discovered over 1,400 domains--not present in the original dataset--we suspect to be blocked. In short, automatically updating probe lists is possible, and can help further automate censorship measurements at scale.
Related papers
- Understanding Routing-Induced Censorship Changes Globally [5.79183660559872]
We investigate the extent to which Equal-cost Multi-path (ECMP) routing is the cause for inconsistencies in censorship results.
We find ECMP routing significantly changes observed censorship across protocols, censor mechanisms, and in 17 countries.
Our work points to methods for improving future studies, reducing inconsistencies and increasing repeatability.
arXiv Detail & Related papers (2024-06-27T16:21:31Z) - Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors [57.7003399760813]
We explore advanced Large Language Models (LLMs) and their specialized variants, contributing to this field in several ways.
We uncover a significant correlation between topics and detection performance.
These investigations shed light on the adaptability and robustness of these detection methods across diverse topics.
arXiv Detail & Related papers (2023-12-20T10:53:53Z) - Cookiescanner: An Automated Tool for Detecting and Evaluating GDPR
Consent Notices on Websites [1.3416250383686867]
We present emphcookiescanner, an automated scanning tool that detects and extracts consent notices.
We found that manually filter lists have the highest precision but recall fewer consent notices than our keyword-based methods.
Our BERT model achieves high precision for English notices, which is in line with previous work, but suffers from low recall due to insufficient candidate extraction.
arXiv Detail & Related papers (2023-09-12T13:04:00Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - WebCPM: Interactive Web Search for Chinese Long-form Question Answering [104.676752359777]
Long-form question answering (LFQA) aims at answering complex, open-ended questions with detailed, paragraph-length responses.
We introduce WebCPM, the first Chinese LFQA dataset.
We collect 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.
arXiv Detail & Related papers (2023-05-11T14:47:29Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Augmenting Rule-based DNS Censorship Detection at Scale with Machine
Learning [38.00013408742201]
Censorship of the domain name system (DNS) is a key mechanism used across different countries.
In this paper, we explore how machine learning (ML) models can help streamline the detection process.
We find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing probes.
arXiv Detail & Related papers (2023-02-03T23:36:30Z) - An Adversarial Attack Analysis on Malicious Advertisement URL Detection
Framework [22.259444589459513]
Malicious advertisement URLs pose a security risk since they are the source of cyber-attacks.
Existing malicious URL detection techniques are limited and to handle unseen features as well as generalize to test data.
In this study, we extract a novel set of lexical and web-scrapped features and employ machine learning technique to set up system for fraudulent advertisement URLs detection.
arXiv Detail & Related papers (2022-04-27T20:06:22Z) - Fingerprinting Search Keywords over HTTPS at Scale [0.5549359079450177]
fingerprinting the search keywords issued by a user on popular web search engines is a significant threat to user privacy.
We study the impact of several factors, including client platform diversity, choice of search engine, feature sets and classification frameworks.
Our analysis reveals several insights into the threat of keyword fingerprinting in modern HTTPS traffic.
arXiv Detail & Related papers (2020-08-18T21:24:52Z) - On the Social and Technical Challenges of Web Search Autosuggestion
Moderation [118.47867428272878]
Autosuggestions are typically generated by machine learning (ML) systems trained on a corpus of search logs and document representations.
While current search engines have become increasingly proficient at suppressing such problematic suggestions, there are still persistent issues that remain.
We discuss several dimensions of problematic suggestions, difficult issues along the pipeline, and why our discussion applies to the increasing number of applications beyond web search.
arXiv Detail & Related papers (2020-07-09T19:22:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.