The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models
- URL: http://arxiv.org/abs/2111.02168v4
- Date: Fri, 23 Feb 2024 19:22:23 GMT
- Title: The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models
- Authors: Alexandra Hotti, Riccardo Sven Risuleo, Stefan Magureanu, Aref Moradi,
Jens Lagergren
- Abstract summary: We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
- Score: 51.39011092347136
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Web automation holds the potential to revolutionize how users interact with
the digital world, offering unparalleled assistance and simplifying tasks via
sophisticated computational methods. Central to this evolution is the web
element nomination task, which entails identifying unique elements on webpages.
Unfortunately, the development of algorithmic designs for web automation is
hampered by the scarcity of comprehensive and realistic datasets that reflect
the complexity faced by real-world applications on the Web. To address this, we
introduce the Klarna Product Page Dataset, a comprehensive and diverse
collection of webpages that surpasses existing datasets in richness and
variety. The dataset features 51,701 manually labeled product pages from 8,175
e-commerce websites across eight geographic regions, accompanied by a dataset
of rendered page screenshots. To initiate research on the Klarna Product Page
Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on
the web element nomination task. We make three important contributions. First,
we found that a simple Convolutional GNN (GCN) outperforms complex
state-of-the-art nomination methods. Second, we introduce a training refinement
procedure that involves identifying a small number of relevant elements from
each page using the aforementioned GCN. These elements are then passed to a
large language model for the final nomination. This procedure significantly
improves the nomination accuracy by 16.8 percentage points on our challenging
dataset, without any need for fine-tuning. Finally, in response to another
prevalent challenge in this field - the abundance of training methodologies
suitable for element nomination - we introduce the Challenge Nomination
Training Procedure, a novel training approach that further boosts nomination
accuracy.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web [118.67589717634281]
Continual learning often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice.
We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation.
Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification.
arXiv Detail & Related papers (2023-11-19T10:43:43Z) - AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud
Registration [69.21282992341007]
Auto Synth automatically generates 3D training data for point cloud registration.
We replace the point cloud registration network with a much smaller surrogate network, leading to a $4056.43$ speedup.
Our results on TUD-L, LINEMOD and Occluded-LINEMOD evidence that a neural network trained on our searched dataset yields consistently better performance than the same one trained on the widely used ModelNet40 dataset.
arXiv Detail & Related papers (2023-09-20T09:29:44Z) - PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - GROWN+UP: A Graph Representation Of a Webpage Network Utilizing
Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually.
We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z) - Incremental Learning Meets Transfer Learning: Application to Multi-site
Prostate MRI Segmentation [16.50535949349874]
We propose a novel multi-site segmentation framework called incremental-transfer learning (ITL)
ITL learns a model from multi-site datasets in an end-to-end sequential fashion.
We show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic problems in incremental learning.
arXiv Detail & Related papers (2022-06-03T02:32:01Z) - Inducing Gaussian Process Networks [80.40892394020797]
We propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points.
The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains.
We report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods.
arXiv Detail & Related papers (2022-04-21T05:27:09Z) - Meta Propagation Networks for Graph Few-shot Semi-supervised Learning [39.96930762034581]
We propose a novel network architecture equipped with a novel meta-learning algorithm to solve this problem.
In essence, our framework Meta-PN infers high-quality pseudo labels on unlabeled nodes via a meta-learned label propagation strategy.
Our approach offers easy and substantial performance gains compared to existing techniques on various benchmark datasets.
arXiv Detail & Related papers (2021-12-18T00:11:56Z) - Exploiting Shared Representations for Personalized Federated Learning [54.65133770989836]
We propose a novel federated learning framework and algorithm for learning a shared data representation across clients and unique local heads for each client.
Our algorithm harnesses the distributed computational power across clients to perform many local-updates with respect to the low-dimensional local parameters for every update of the representation.
This result is of interest beyond federated learning to a broad class of problems in which we aim to learn a shared low-dimensional representation among data distributions.
arXiv Detail & Related papers (2021-02-14T05:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.