Measuring and Modeling the Free Content Web
- URL: http://arxiv.org/abs/2304.14359v1
- Date: Wed, 26 Apr 2023 04:17:43 GMT
- Title: Measuring and Modeling the Free Content Web
- Authors: Abdulrahman Alabduljabbar and Runyu Ma and Ahmed Abusnaina and Rhongho
Jang and Songqing Chen and DaeHun Nyang and and David Mohaisen
- Abstract summary: We investigate the similarities and differences between free content and premium websites.
For risk analysis, we consider and examine the maliciousness of these websites at the website- and component-level.
- Score: 13.982229874909978
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Free content websites that provide free books, music, games, movies, etc.,
have existed on the Internet for many years. While it is a common belief that
such websites might be different from premium websites providing the same
content types, an analysis that supports this belief is lacking in the
literature. In particular, it is unclear if those websites are as safe as their
premium counterparts. In this paper, we set out to investigate, by analysis and
quantification, the similarities and differences between free content and
premium websites, including their risk profiles. To conduct this analysis, we
assembled a list of 834 free content websites offering books, games, movies,
music, and software, and 728 premium websites offering content of the same
type. We then contribute domain-, content-, and risk-level analysis, examining
and contrasting the websites' domain names, creation times, SSL certificates,
HTTP requests, page size, average load time, and content type. For risk
analysis, we consider and examine the maliciousness of these websites at the
website- and component-level. Among other interesting findings, we show that
free content websites tend to be vastly distributed across the TLDs and exhibit
more dynamics with an upward trend for newly registered domains. Moreover, the
free content websites are 4.5 times more likely to utilize an expired
certificate, 19 times more likely to be malicious at the website level, and
2.64 times more likely to be malicious at the component level. Encouraged by
the clear differences between the two types of websites, we explore the
automation and generalization of the risk modeling of the free content risky
websites, showing that a simple machine learning-based technique can produce
86.81\% accuracy in identifying them.
Related papers
- Securing the Web: Analysis of HTTP Security Headers in Popular Global Websites [2.7039386580759666]
Over half of the websites examined (55.66%) received a dismal security grade of 'F'
These low scores expose multiple issues such as weak implementation of Content Security Policies (CSP), neglect of HSTS guidelines, and insufficient application of Subresource Integrity (SRI)
arXiv Detail & Related papers (2024-10-19T01:03:59Z) - The Web unpacked: a quantitative analysis of global Web usage [0.0]
We estimate the total web traffic and investigate its distribution among domains and industry sectors.
Our analysis reveals a significant concentration of web traffic, with a diminutive number of top websites capturing the majority of visits.
Much of the traffic goes to for-profit but mostly free-of-charge websites, highlighting the dominance of business models not based on paywalls.
arXiv Detail & Related papers (2024-04-26T01:05:47Z) - Dismantling Common Internet Services for Ad-Malware Detection [0.0]
We evaluate who defines ad-malware on the Internet.
Up to 0.47% of the domains found during crawling are labeled as suspicious by DNS providers.
Only about 0.7% to 3.2% of these domains are categorized as ad-malware.
arXiv Detail & Related papers (2024-04-22T13:59:37Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - User Attitudes to Content Moderation in Web Search [49.1574468325115]
We examine the levels of support for different moderation practices applied to potentially misleading and/or potentially offensive content in web search.
We find that the most supported practice is informing users about potentially misleading or offensive content, and the least supported one is the complete removal of search results.
More conservative users and users with lower levels of trust in web search results are more likely to be against content moderation in web search.
arXiv Detail & Related papers (2023-10-05T10:57:15Z) - An Image is Worth a Thousand Toxic Words: A Metamorphic Testing
Framework for Content Moderation Software [64.367830425115]
Social media platforms are being increasingly misused to spread toxic content, including hate speech, malicious advertising, and pornography.
Despite tremendous efforts in developing and deploying content moderation methods, malicious users can evade moderation by embedding texts into images.
We propose a metamorphic testing framework for content moderation software.
arXiv Detail & Related papers (2023-08-18T20:33:06Z) - Do Content Management Systems Impact the Security of Free Content
Websites? A Correlation Analysis [9.700241283477343]
Assembling more than 1,500 websites with free and premium content, we identify their content management system (CMS) and malicious attributes.
We find that, despite the significant number of custom code websites, the use of CMS's is pervasive.
Even a small number of unpatched vulnerabilities in popular CMS's could be a potential cause for significant maliciousness.
arXiv Detail & Related papers (2022-10-21T16:19:09Z) - Modeling Content Creator Incentives on Algorithm-Curated Platforms [76.53541575455978]
We study how algorithmic choices affect the existence and character of (Nash) equilibria in exposure games.
We propose tools for numerically finding equilibria in exposure games, and illustrate results of an audit on the MovieLens and LastFM datasets.
arXiv Detail & Related papers (2022-06-27T08:16:59Z) - Leveraging Google's Publisher-specific IDs to Detect Website
Administration [3.936965297430477]
We propose a novel, graph-based methodology to detect administration of websites on the Web.
We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration.
Our findings show that approximately 90% of the websites are associated each with a single publisher, and that small publishers tend to manage less popular websites.
arXiv Detail & Related papers (2022-02-10T14:59:17Z) - What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.