Development of an Automated Web Application for Efficient Web Scraping: Design and Implementation
- URL: http://arxiv.org/abs/2510.21831v1
- Date: Wed, 22 Oct 2025 04:56:00 GMT
- Title: Development of an Automated Web Application for Efficient Web Scraping: Design and Implementation
- Authors: Alok Dutta, Nilanjana Roy, Rhythm Sen, Sougata Dutta, Prabhat Das,
- Abstract summary: This paper presents the design and implementation of a user-friendly, automated web application that simplifies and optimize the web scraping process for non-technical users.<n>The application breaks down the complex task of web scraping into three main stages: fetching, extraction, and execution.<n>This automated tool not only enhances the efficiency of web scraping but also democratizes access to data extraction by empowering users of all technical levels to gather and manage data tailored to their needs.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper presents the design and implementation of a user-friendly, automated web application that simplifies and optimizes the web scraping process for non-technical users. The application breaks down the complex task of web scraping into three main stages: fetching, extraction, and execution. In the fetching stage, the application accesses target websites using the HTTP protocol, leveraging the requests library to retrieve HTML content. The extraction stage utilizes powerful parsing libraries like BeautifulSoup and regular expressions to extract relevant data from the HTML. Finally, the execution stage structures the data into accessible formats, such as CSV, ensuring the scraped content is organized for easy use. To provide personalized and secure experiences, the application includes user registration and login functionalities, supported by MongoDB, which stores user data and scraping history. Deployed using the Flask framework, the tool offers a scalable, robust environment for web scraping. Users can easily input website URLs, define data extraction parameters, and download the data in a simplified format, without needing technical expertise. This automated tool not only enhances the efficiency of web scraping but also democratizes access to data extraction by empowering users of all technical levels to gather and manage data tailored to their needs. The methodology detailed in this paper represents a significant advancement in making web scraping tools accessible, efficient, and easy to use for a broader audience.
Related papers
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z) - Beyond BeautifulSoup: Benchmarking LLM-Powered Web Scraping for Everyday Users [5.7578515237305625]
Large language models (LLMs) have democratized web scraping, enabling low-skill users to execute sophisticated operations through simple natural language prompts.<n>We show that without extensive manual effort, current LLM-based benchmarks allow novice users to scrape websites that would otherwise be inaccessible.
arXiv Detail & Related papers (2026-01-09T20:34:28Z) - Nested Browser-Use Learning for Agentic Information Seeking [60.775556172513014]
Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching.<n>We propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure.
arXiv Detail & Related papers (2025-12-29T17:59:14Z) - SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning [48.376164461507244]
We introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework.<n>Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages.<n> Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o.
arXiv Detail & Related papers (2025-10-02T09:27:15Z) - WebDS: An End-to-End Benchmark for Web-based Data Science [59.270670758607494]
WebDS is the first end-to-end web-based data science benchmark.<n>It comprises 870 web-based data science tasks across 29 diverse websites.<n>WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
arXiv Detail & Related papers (2025-08-02T06:39:59Z) - Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records.<n>To address this gap, we created a large-scale, open-access dataset specifically designed for list pages.<n>Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z) - An innovative data collection method to eliminate the preprocessing phase in web usage mining [0.0]
The underlying data source for web usage mining (WUM) is commonly thought to be server logs.<n>This study proposes an innovative method for user tracking, session management, and collecting web usage data.<n>An application-based API has been developed with a different strategy from conventional client-side methods to obtain and process log data.
arXiv Detail & Related papers (2025-01-08T09:03:16Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Cleaner Pretraining Corpus Curation with Neural Web Scraping [39.97459187762505]
This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages.
Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement.
arXiv Detail & Related papers (2024-02-22T16:04:03Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.