A Large Visual, Qualitative and Quantitative Dataset of Web Pages
- URL: http://arxiv.org/abs/2105.07113v1
- Date: Sat, 15 May 2021 01:31:25 GMT
- Title: A Large Visual, Qualitative and Quantitative Dataset of Web Pages
- Authors: Christian Mejia-Escobar, Miguel Cazorla, Ester Martinez-Martin
- Abstract summary: We have created a large dataset of 49,438 Web pages.
It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics.
- Score: 4.5002924206836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The World Wide Web is not only one of the most important platforms of
communication and information at present, but also an area of growing interest
for scientific research. This motivates a lot of work and projects that require
large amounts of data. However, there is no dataset that integrates the
parameters and visual appearance of Web pages, because its collection is a
costly task in terms of time and effort. With the support of various computer
tools and programming scripts, we have created a large dataset of 49,438 Web
pages. It consists of visual, textual and numerical data types, includes all
countries worldwide, and considers a broad range of topics such as art,
entertainment, economy, business, education, government, news, media, science,
and environment, covering different cultural characteristics and varied design
preferences. In this paper, we describe the process of collecting, debugging
and publishing the final product, which is freely available. To demonstrate the
usefulness of our dataset, we expose a binary classification model for
detecting error Web pages, and a multi-class Web subject-based categorization,
both problems using convolutional neural networks.
Related papers
- Unlocking Comics: The AI4VA Dataset for Visual Understanding [62.345344799258804]
This paper presents a novel dataset comprising Franco-Belgian comics from the 1950s annotated for tasks including depth estimation, semantic segmentation, saliency detection, and character identification.
It consists of two distinct and consistent styles and incorporates object concepts and labels taken from natural images.
By including such diverse information across styles, this dataset not only holds promise for computational creativity but also offers avenues for the digitization of art and storytelling innovation.
arXiv Detail & Related papers (2024-10-27T14:27:05Z) - Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics.
Access to these datasets is often restricted due to costs and platform regulations.
This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Creating Knowledge Graphs for Geographic Data on the Web [6.654753562389985]
Geographic data plays an essential role in various Web, Semantic Web and machine learning applications.
This article describes recent approaches we developed to tackle these challenges.
arXiv Detail & Related papers (2023-02-17T11:44:49Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Scene-Aware
Ambidextrous Bin Picking via Physics-based Metaverse Synthesis [72.85526892440251]
We introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis.
The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper.
We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties.
arXiv Detail & Related papers (2022-08-08T08:15:34Z) - GROWN+UP: A Graph Representation Of a Webpage Network Utilizing
Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually.
We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z) - A Web Scale Entity Extraction System [9.300916856534007]
We present learnings from our efforts in building an entity extraction system for multiple document types at large scale.
We empirically demonstrate the effectiveness of multi-lingual, multi-task and cross-document type learning.
We also discuss the label collection schemes that help to minimize the amount of noise in the collected data.
arXiv Detail & Related papers (2021-08-27T16:37:37Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.