Fugu-MT 論文翻訳(概要): Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

論文の概要: Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

arxiv url: http://arxiv.org/abs/2605.28787v1
Date: Wed, 27 May 2026 17:46:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.256098
Title: Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval
Title（参考訳）: エージェントは意味的メタデータを必要とするか? : エージェントデータ検索における比較研究
Authors: Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy,
Abstract要約: 構造化されていないWebをナビゲートできるLarge Language Models(LLM)の台頭は、根本的な疑問を提起する。オープンウェブ文書を検索するベースラインエージェントと、スキーマ.orgを用いて9000万のデータセットのコーパスを利用するセマンティックエージェントの2つの異なる環境におけるエージェントデータ検索の比較分析を行う。セマンティック・エージェントは動作可能なデータを抽出し、メタデータ豊富なレジストリの44.9%の精度と、返却された結果の中で機械可読なダウンロードを持つページの46.6%の精度を達成する。
参考スコア（独自算出の注目度）: 3.6202634353482352
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.
Abstract（参考訳）: 自律エージェントの時代において、マシン操作可能なデータはデータ駆動のワークフローに不可欠である。スキーマ.orgのようなセマンティックメタデータは10年以上にわたり、マシン操作可能なデータに対してFAIRの原則(Findable、Accessible、Interoperable、Reusable)を固定し、Google Dataset Searchのような発見ツールを有効にした。しかし、構造化されていないWebをナビゲートできるLarge Language Models(LLM)の台頭は、根本的な疑問を提起している。オープンウェブ文書を検索するベースラインエージェントと、スキーマ.orgを用いて9000万のデータセットのコーパスを利用するセマンティックエージェントの2つの異なる環境におけるエージェントデータ検索の比較分析を行う。 FAIRの原則に直接マッピングした"LLM-as-a-judge"評価パイプラインをデプロイし、検索したデータのセマンティック関連性、データアクセシビリティ、計算ユーティリティを評価する。私たちの結果は明らかな相違を明らかにします。セマンティック・エージェントは動作可能なデータを抽出し、メタデータ豊富なレジストリの44.9%の精度と、返却された結果の中で機械可読なダウンロードを持つページの46.6%の精度を達成する。逆に、ベースラインエージェントはしばしば"Last-Miile Utility"の障害に悩まされ、実際のデータページではなく、散文の重いページ(20.1%)とポータルのランディングページ(8.5%)を取得する。 Baseline Agentは40%以上の質問に回答することで、より高いカバレッジを達成するが、Semantic Agentはより精度が高く、FAIR準拠のデータセットを取得する際の全体的な精度は65.7%向上している。我々は、非構造化検索は広範な探索的タスクをサポートするが、構造化されたエコシステムは信頼性の高い実行指向の自律ワークフローにとって必須の基盤である、と結論付けている。

論文の概要: Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

関連論文リスト