Related papers: The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models

The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models

URL: http://arxiv.org/abs/2111.02168v4
Date: Fri, 23 Feb 2024 19:22:23 GMT
Title: The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models
Authors: Alexandra Hotti, Riccardo Sven Risuleo, Stefan Magureanu, Aref Moradi, Jens Lagergren
Abstract summary: We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety. We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page. Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
Score: 51.39011092347136
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Web automation holds the potential to revolutionize how users interact with the digital world, offering unparalleled assistance and simplifying tasks via sophisticated computational methods. Central to this evolution is the web element nomination task, which entails identifying unique elements on webpages. Unfortunately, the development of algorithmic designs for web automation is hampered by the scarcity of comprehensive and realistic datasets that reflect the complexity faced by real-world applications on the Web. To address this, we introduce the Klarna Product Page Dataset, a comprehensive and diverse collection of webpages that surpasses existing datasets in richness and variety. The dataset features 51,701 manually labeled product pages from 8,175 e-commerce websites across eight geographic regions, accompanied by a dataset of rendered page screenshots. To initiate research on the Klarna Product Page Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. We make three important contributions. First, we found that a simple Convolutional GNN (GCN) outperforms complex state-of-the-art nomination methods. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page using the aforementioned GCN. These elements are then passed to a large language model for the final nomination. This procedure significantly improves the nomination accuracy by 16.8 percentage points on our challenging dataset, without any need for fine-tuning. Finally, in response to another prevalent challenge in this field - the abundance of training methodologies suitable for element nomination - we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.

Related papers

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces [5.150606279179606]
WebChain is the largest open-source dataset of human-annotated trajectories on real-world websites.<n>Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.
arXiv Detail & Related papers (2026-03-05T15:37:34Z)
Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation [54.945281159783896]
We present a scalable pipeline for automatically generating high-quality training data for web agents.<n>We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion.
arXiv Detail & Related papers (2026-02-13T02:52:18Z)
WebDS: An End-to-End Benchmark for Web-based Data Science [59.270670758607494]
WebDS is the first end-to-end web-based data science benchmark.<n>It comprises 870 web-based data science tasks across 29 diverse websites.<n>WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
arXiv Detail & Related papers (2025-08-02T06:39:59Z)
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction [83.0216122783429]
Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents. We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
arXiv Detail & Related papers (2025-04-22T04:07:13Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale. We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials. Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web [118.67589717634281]
Continual learning often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification.
arXiv Detail & Related papers (2023-11-19T10:43:43Z)
AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud Registration [69.21282992341007]
Auto Synth automatically generates 3D training data for point cloud registration. We replace the point cloud registration network with a much smaller surrogate network, leading to a $4056.43$ speedup. Our results on TUD-L, LINEMOD and Occluded-LINEMOD evidence that a neural network trained on our searched dataset yields consistently better performance than the same one trained on the widely used ModelNet40 dataset.
arXiv Detail & Related papers (2023-09-20T09:29:44Z)
PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z)
GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z)
Incremental Learning Meets Transfer Learning: Application to Multi-site Prostate MRI Segmentation [16.50535949349874]
We propose a novel multi-site segmentation framework called incremental-transfer learning (ITL) ITL learns a model from multi-site datasets in an end-to-end sequential fashion. We show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic problems in incremental learning.
arXiv Detail & Related papers (2022-06-03T02:32:01Z)
Inducing Gaussian Process Networks [80.40892394020797]
We propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points. The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains. We report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods.
arXiv Detail & Related papers (2022-04-21T05:27:09Z)
Meta Propagation Networks for Graph Few-shot Semi-supervised Learning [39.96930762034581]
We propose a novel network architecture equipped with a novel meta-learning algorithm to solve this problem. In essence, our framework Meta-PN infers high-quality pseudo labels on unlabeled nodes via a meta-learned label propagation strategy. Our approach offers easy and substantial performance gains compared to existing techniques on various benchmark datasets.
arXiv Detail & Related papers (2021-12-18T00:11:56Z)
Exploiting Shared Representations for Personalized Federated Learning [54.65133770989836]
We propose a novel federated learning framework and algorithm for learning a shared data representation across clients and unique local heads for each client. Our algorithm harnesses the distributed computational power across clients to perform many local-updates with respect to the low-dimensional local parameters for every update of the representation. This result is of interest beyond federated learning to a broad class of problems in which we aim to learn a shared low-dimensional representation among data distributions.
arXiv Detail & Related papers (2021-02-14T05:36:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.