Related papers: "Way back then": A Data-driven View of 25+ years of Web Evolution

"Way back then": A Data-driven View of 25+ years of Web Evolution

URL: http://arxiv.org/abs/2202.08239v1
Date: Wed, 16 Feb 2022 18:36:03 GMT
Title: "Way back then": A Data-driven View of 25+ years of Web Evolution
Authors: Vibhor Agarwal, Nishanth Sastry
Abstract summary: We look at the top 100 Alexa websites for over 25 years from the Internet Archive or the "Wayback Machine", archive.org. We study the changes in popularity, from Geocities and Yahoo! in the mid-to-late 1990s to the likes of Google, Facebook, and Tiktok of today. We also look at different categories of websites and their popularity over the years and find evidence for the decline in popularity of news and education-related websites.
Score: 4.055696230852368
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Since the inception of the first web page three decades back, the Web has evolved considerably, from static HTML pages in the beginning to the dynamic web pages of today, from mainly the text-based pages of the 1990s to today's multimedia rich pages, etc. Although much of this is known anecdotally, to our knowledge, there is no quantitative documentation of the extent and timing of these changes. This paper attempts to address this gap in the literature by looking at the top 100 Alexa websites for over 25 years from the Internet Archive or the "Wayback Machine", archive.org. We study the changes in popularity, from Geocities and Yahoo! in the mid-to-late 1990s to the likes of Google, Facebook, and Tiktok of today. We also look at different categories of websites and their popularity over the years and find evidence for the decline in popularity of news and education-related websites, which have been replaced by streaming media and social networking sites. We explore the emergence and relative prevalence of different MIME-types (text vs. image vs. video vs. javascript and json) and study whether the use of text on the Internet is declining.

Related papers

The Case for HTML First Web Development [0.0]
HTML First development puts focus on literally using HTML first when possible.<n>It seems HTML-oriented web development can provide clear benefits to developers.<n>There are open questions related to the magnitude of the benefits and the alignment with the recent trend of AI-driven web development.
arXiv Detail & Related papers (2026-02-19T09:23:21Z)
Towards Scalable Topic Detection on Web via Simulating Levy Walks Nature of Topics in Similarity Space [55.97416108140739]
We present a novel, yet very powerful Explore-Exploit (EE) approach to group topics by simulating Levy walks nature in the similarity space. Experiments on two public data sets demonstrate that our approach is not only comparable to the state-of-the-art methods in terms of effectiveness but also significantly outperforms the state-of-the-art methods in terms of efficiency.
arXiv Detail & Related papers (2024-07-26T07:19:46Z)
Health Misinformation Detection in Web Content via Web2Vec: A Structural-, Content-based, and Context-aware Approach based on Web2Vec [3.299010876315217]
We focus on Web page content, where there is still room for research to study structural-, content- and context-based features to assess the credibility of Web pages. This work aims to study the effectiveness of such features in association with a deep learning model, starting from an embedded representation of Web pages that has been recently proposed in the context of phishing Web page detection, i.e., Web2Vec.
arXiv Detail & Related papers (2024-07-05T10:33:15Z)
Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains [3.659498819753633]
We develop a website credibility classification and discovery system that integrates webgraph and social media contexts. We introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines. We release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.
arXiv Detail & Related papers (2024-06-17T11:22:04Z)
Forgotten Knowledge: Examining the Citational Amnesia in NLP [63.13508571014673]
We show how far back in time do we tend to go to cite papers? How has that changed over time, and what factors correlate with this citational attention/amnesia? We show that around 62% of cited papers are from the immediate five years prior to publication, whereas only about 17% are more than ten years old. We show that the median age and age diversity of cited papers were steadily increasing from 1990 to 2014, but since then, the trend has reversed, and current NLP papers have an all-time low temporal citation diversity.
arXiv Detail & Related papers (2023-05-29T18:30:34Z)
Web 3.0: The Future of Internet [53.234101208024335]
Web 3.0 is a decentralized Web architecture that is more intelligent and safer than before. Web 3.0 is capable of addressing web data ownership according to distributed technology. It will optimize the internet world from the perspectives of economy, culture, and technology.
arXiv Detail & Related papers (2023-03-23T15:37:42Z)
Web3: The Next Internet Revolution [50.16560061003771]
Next internet revolution: Web3 is going to open new opportunities for traditional social models. Decentralized finance will be global, and open with financial inclusiveness for unbanked people. Several worthwhile future research directions of Web3 are discussed.
arXiv Detail & Related papers (2023-03-22T23:37:43Z)
Leveraging Google's Publisher-specific IDs to Detect Website Administration [3.936965297430477]
We propose a novel, graph-based methodology to detect administration of websites on the Web. We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration. Our findings show that approximately 90% of the websites are associated each with a single publisher, and that small publishers tend to manage less popular websites.
arXiv Detail & Related papers (2022-02-10T14:59:17Z)
Prediction of new outlinks for focused Web crawling [0.0]
This work provides a methodology for detecting new links effectively using a short history. We provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links. A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page.
arXiv Detail & Related papers (2021-11-09T11:36:21Z)
The Rise and Fall of Fake News sites: A Traffic Analysis [62.51737815926007]
We investigate the online presence of fake news websites and characterize their behavior in comparison to real news websites. Based on our findings, we build a content-agnostic ML for automatic detection of fake news websites.
arXiv Detail & Related papers (2021-03-16T18:10:22Z)
Echo Chambers on Social Media: A comparative analysis [64.2256216637683]
We introduce an operational definition of echo chambers and perform a massive comparative analysis on 1B pieces of contents produced by 1M users on four social media platforms. We infer the leaning of users about controversial topics and reconstruct their interaction networks by analyzing different features. We find support for the hypothesis that platforms implementing news feed algorithms like Facebook may elicit the emergence of echo-chambers.
arXiv Detail & Related papers (2020-04-20T20:00:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.