Fugu-MT 論文翻訳(概要): LiveWeb-IE: A Benchmark For Online Web Information Extraction

論文の概要: LiveWeb-IE: A Benchmark For Online Web Information Extraction

arxiv url: http://arxiv.org/abs/2603.13773v1
Date: Sat, 14 Mar 2026 05:55:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.397439
Title: LiveWeb-IE: A Benchmark For Online Web Information Extraction
Title（参考訳）: LiveWeb-IE:オンラインWeb情報抽出のベンチマーク
Authors: Seungbin Yang, Jihwan Kim, Jaemin Choi, Dongjin Kim, Soyoung Yang, ChaeHun Park, Jaegul Choo,
Abstract要約: Web情報抽出(WIE)は、Webページから自動的にデータを抽出するタスクであり、様々なアプリケーションに高いユーティリティを提供する。ライブWebサイトに対して,WIEシステムを直接評価するための新しいベンチマークである,データセットを導入する。また,Webページコンテンツを視覚的に絞り込み,所望の情報を抽出することで,人間の認知過程を模倣する新しい多段階エージェントフレームワークであるVisual Grounding Scraper (VGS)を提案する。
参考スコア（独自算出の注目度）: 48.82654261583883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.
Abstract（参考訳）: Web情報抽出(WIE)は、Webページから自動的にデータを抽出するタスクであり、様々なアプリケーションに高いユーティリティを提供する。 WIEシステムの評価は、伝統的に、単一の時点でキャプチャされたHTMLスナップショットから構築されたベンチマークに依存してきた。しかし、このオフライン評価パラダイムは、Webの時間的に進化する性質を考慮できないため、静的ベンチマークのパフォーマンスは、しばしば動的な現実のシナリオに一般化できない。このギャップを埋めるために、生のWebサイトに対して直接WIEシステムを評価するために設計された新しいベンチマークである \datasetを紹介します。信頼された,許可されたWebサイトに基づいて,テキストや画像,ハイパーリンクなど,さまざまなデータカテゴリの情報抽出を必要とする自然言語クエリをキュレートする。さらに、抽出する属性の数と濃度に基づいて、これらのクエリを4段階の複雑さを表すように設計し、WIEシステムの詳細な評価を可能にした。さらに,Webページコンテンツを視覚的に絞り込み,所望の情報を抽出することで,人間の認知過程を模倣する新しい多段階エージェントフレームワークであるVisual Grounding Scraper (VGS)を提案する。様々なバックボーンモデルに対する大規模な実験は、VGSの有効性と堅牢性を示している。我々は,本研究が,実用的で堅牢なWIEシステム開発の基礎となると信じている。

論文の概要: LiveWeb-IE: A Benchmark For Online Web Information Extraction

関連論文リスト