Fugu-MT 論文翻訳(概要): AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

論文の概要: AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

arxiv url: http://arxiv.org/abs/2604.02690v1
Date: Fri, 03 Apr 2026 03:34:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.305191
Title: AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis
Title（参考訳）: AnnoRetrieve: 構造化されていない文書解析のための効率的な構造化検索
Authors: Teng Lin, Yuyu Luo, Nan Tang,
Abstract要約: 埋め込みから構造化アノテーションへ移行する新しい検索パラダイムであるAnnoRetrieveを提案する。提案システムは,高コストなベクトル比較を,自動生成スキーマよりも軽量な構造化クエリに置き換える。 AnoRetrieveは、インテリジェントな構造化を通じて、コスト効率、正確、スケーラブルなドキュメント分析のための新しいパラダイムを確立している。
参考スコア（独自算出の注目度）: 11.689256498133446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unstructured documents dominate enterprise and web data, but their lack of explicit organization hinders precise information retrieval. Current mainstream retrieval methods, especially embedding-based vector search, rely on coarse-grained semantic similarity, incurring high computational cost and frequent LLM calls for post-processing. To address this critical issue, we propose AnnoRetrieve, a novel retrieval paradigm that shifts from embeddings to structured annotations, enabling precise, annotation-driven semantic retrieval. Our system replaces expensive vector comparisons with lightweight structured queries over automatically induced schemas, dramatically reducing LLM usage and overall cost. The system integrates two synergistic core innovations: SchemaBoot, which automatically generates document annotation schemas via multi-granularity pattern discovery and constraint-based optimization, laying a foundation for annotation-driven retrieval and eliminating manual schema design, and Structured Semantic Retrieval (SSR), the core retrieval engine, which unifies semantic understanding with structured query execution; by leveraging the annotated structure instead of vector embeddings, SSR achieves precise semantic matching, seamlessly completing attribute-value extraction, table generation, and progressive SQL-based reasoning without relying on LLM interventions. This annotation-driven paradigm overcomes the limitations of traditional vector-based methods with coarse-grained matching and heavy LLM dependency and graph-based methods with high computational overhead. Experiments on three real-world datasets confirm that AnnoRetrieve significantly lowers LLM call frequency and retrieval cost while maintaining high accuracy. AnnoRetrieve establishes a new paradigm for cost-effective, precise, and scalable document analysis through intelligent structuring.
Abstract（参考訳）: 構造化されていない文書が企業やWebデータを支配しているが、明示的な組織が欠如しているため正確な情報検索が困難である。現在の主流検索法、特に埋め込みに基づくベクトル探索は、粗い粒度のセマンティックな類似性に依存しており、計算コストが高く、後処理を頻繁に行う。この重要な問題に対処するため、我々はAnnoRetrieveを提案する。AnnoRetrieveは、埋め込みから構造化アノテーションへ移行し、正確なアノテーション駆動のセマンティック検索を可能にする新しい検索パラダイムである。提案システムでは,高コストなベクトル比較を,自動生成スキーマよりも軽量な構造化クエリに置き換え,LCMの使用率と全体的なコストを劇的に削減する。システムは2つのシナジスティックなコアイノベーションを統合している: SchemaBootは、複数の粒度パターンの発見と制約ベースの最適化を通じて文書アノテーションスキーマを自動的に生成し、アノテーション駆動の検索と手動スキーマ設計の除去のための基盤を構築し、構造化されたクエリ実行とセマンティック理解を統一するコア検索エンジンであるStructured Semantic Retrieval (SSR) 、ベクトル埋め込みの代わりにアノテーション構造を活用することにより、SSRは正確なセマンティックマッチングを達成し、属性値抽出、テーブル生成、プログレッシブSQLベースの推論をLLMの介入に頼ることなくシームレスに完了する。このアノテーション駆動のパラダイムは、粗粒度マッチングと重いLLM依存性と高い計算オーバーヘッドを持つグラフベースの手法による従来のベクトルベースの手法の制限を克服する。 3つの実世界のデータセットの実験により、AnnoRetrieveは高い精度を維持しながらLLM呼び出し頻度と検索コストを著しく低減することを確認した。 AnnoRetrieveは、インテリジェントな構造化を通じて、コスト効率、正確、スケーラブルなドキュメント分析のための新しいパラダイムを確立している。

論文の概要: AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

関連論文リスト