Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations
- URL: http://arxiv.org/abs/2502.16150v1
- Date: Sat, 22 Feb 2025 08:52:01 GMT
- Title: Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations
- Authors: Anton Pogrebnjak, Julian Schelb, Andreas Spitz, Celina Kacperski, Roberto Ulloa,
- Abstract summary: Tag-Pag is an application designed to simplify the categorization of web pages.<n>Unlike existing tools that focus on annotating sections of text, Tag-Pag systematizes page-level annotations.
- Score: 2.7961972519572447
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tag-Pag is an application designed to simplify the categorization of web pages, a task increasingly common for researchers who scrape web pages to analyze individuals' browsing patterns or train machine learning classifiers. Unlike existing tools that focus on annotating sections of text, Tag-Pag systematizes page-level annotations, allowing users to determine whether an entire document relates to one or multiple predefined topics. Tag-Pag offers an intuitive interface to configure the input web pages and annotation labels. It integrates libraries to extract content from the HTML and URL indicators to aid the annotation process. It provides direct access to both scraped and live versions of the web page. Our tool is designed to expedite the annotation process with features like quick navigation, label assignment, and export functionality, making it a versatile and efficient tool for various research applications. Tag-Pag is available at https://github.com/Pantonius/TagPag.
Related papers
- Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation.
Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - EEVEE: An Easy Annotation Tool for Natural Language Processing [32.111061774093]
We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use.
It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation.
arXiv Detail & Related papers (2024-02-05T10:24:40Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - POTATO: The Portable Text Annotation Tool [8.924906491840119]
We present POTATO, a free, fully open-sourced annotation system.
It supports labeling many types of text and multimodal data.
It offers easy-to-configure features to maximize the productivity of both deployers and annotators.
arXiv Detail & Related papers (2022-12-16T17:57:41Z) - SciAnnotate: A Tool for Integrating Weak Labeling Sources for Sequence
Labeling [55.71459234749639]
SciAnnotate is a web-based tool for text annotation called SciAnnotate, which stands for scientific annotation tool.
Our tool provides users with multiple user-friendly interfaces for creating weak labels.
In this study, we take multi-source weak label denoising as an example, we utilized a Bertifying Conditional Hidden Markov Model to denoise the weak label generated by our tool.
arXiv Detail & Related papers (2022-08-07T19:18:13Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - SenTag: a Web-based Tool for Semantic Annotation of Textual Documents [4.910379177401659]
SenTag is a web-based tool focused on semantic annotation of textual documents.
The main goal of the application is two-fold: facilitating the tagging process and reducing or avoiding for errors in the output documents.
It is also possible to assess the level of agreement of annotators working on a corpus of text.
arXiv Detail & Related papers (2021-09-16T08:39:33Z) - PanGEA: The Panoramic Graph Environment Annotation Toolkit [83.12648898284048]
PanGEA is a toolkit for collecting speech and text annotations in photo-realistic 3D environments.
PanGEA immerses annotators in a web-based simulation and allows them to move around easily as they speak and/or listen.
arXiv Detail & Related papers (2021-03-23T17:24:12Z) - Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input.
This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.