ClueWeb22: 10 Billion Web Documents with Rich Information
- URL: http://arxiv.org/abs/2211.15848v1
- Date: Tue, 29 Nov 2022 00:49:40 GMT
- Title: ClueWeb22: 10 Billion Web Documents with Rich Information
- Authors: Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie
Callan
- Abstract summary: ClueWeb22 provides 10 billion web pages affiliated with rich information.
Its design was influenced by the need for a high quality, large scale web corpus to support academic and industry research.
- Score: 28.68403988636645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10
billion web pages affiliated with rich information. Its design was influenced
by the need for a high quality, large scale web corpus to support a range of
academic and industry research, for example, in information systems,
retrieval-augmented AI systems, and model pretraining. Compared with earlier
ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of
higher-quality, and aligned with the document distributions in commercial web
search. Besides raw HTML, ClueWeb22 includes rich information about the web
pages provided by industry-standard document understanding systems, including
the visual representation of pages rendered by a web browser, parsed HTML
structure information from a neural network parser, and pre-processed cleaned
document text to lower the barrier to entry. Many of these signals have been
widely used in industry but are available to the research community for the
first time at this scale.
Related papers
- HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems [62.36019283532854]
Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities.
RAG uses HTML instead of plain text as the format of retrieved knowledge.
We propose HTML cleaning, compression, and pruning strategies to shorten the HTML while minimizing the loss of information.
arXiv Detail & Related papers (2024-11-05T09:58:36Z) - Health Misinformation Detection in Web Content via Web2Vec: A Structural-, Content-based, and Context-aware Approach based on Web2Vec [3.299010876315217]
We focus on Web page content, where there is still room for research to study structural-, content- and context-based features to assess the credibility of Web pages.
This work aims to study the effectiveness of such features in association with a deep learning model, starting from an embedded representation of Web pages that has been recently proposed in the context of phishing Web page detection, i.e., Web2Vec.
arXiv Detail & Related papers (2024-07-05T10:33:15Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels [95.48844474720798]
We introduce MS MARCO Web Search, the first large-scale information-rich web dataset.
This dataset mimics real-world web document and query distribution.
MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks.
arXiv Detail & Related papers (2024-05-13T07:46:44Z) - Hierarchical Multimodal Pre-training for Visually Rich Webpage
Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages.
We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.
Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z) - Cleaner Pretraining Corpus Curation with Neural Web Scraping [39.97459187762505]
This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages.
Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement.
arXiv Detail & Related papers (2024-02-22T16:04:03Z) - PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An
Approach [115.91099791629104]
We construct two new benchmark webly supervised fine-grained datasets, WebFG-496 and WebiNat-5089, respectively.
For WebiNat-5089, it contains 5089 sub-categories and more than 1.1 million web training images, which is the largest webly supervised fine-grained dataset ever.
As a minor contribution, we also propose a novel webly supervised method (termed Peer-learning'') for benchmarking these datasets.
arXiv Detail & Related papers (2021-08-05T06:28:32Z) - A Large Visual, Qualitative and Quantitative Dataset of Web Pages [4.5002924206836]
We have created a large dataset of 49,438 Web pages.
It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics.
arXiv Detail & Related papers (2021-05-15T01:31:25Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.