Related papers: Boilerplate Removal using a Neural Sequence Labeling Model

Boilerplate Removal using a Neural Sequence Labeling Model

URL: http://arxiv.org/abs/2004.14294v1
Date: Wed, 22 Apr 2020 08:06:59 GMT
Title: Boilerplate Removal using a Neural Sequence Labeling Model
Authors: Jurek Leonhardt, Avishek Anand, Megha Khosla
Abstract summary: We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
Score: 4.056234173482691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.

Related papers

WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation [24.99791278208309]
We introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. We present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML.
arXiv Detail & Related papers (2024-07-22T09:35:43Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web [118.67589717634281]
Continual learning often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification.
arXiv Detail & Related papers (2023-11-19T10:43:43Z)
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z)
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities [54.26896306906937]
We present OVEN-Wiki, where a model need to link an image onto a Wikipedia entity with respect to a text query. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. While PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.
arXiv Detail & Related papers (2023-02-22T05:31:26Z)
GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z)
WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z)
FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents [16.101638575566444]
FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network.
arXiv Detail & Related papers (2020-10-21T04:20:13Z)
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.