Fugu-MT 論文翻訳(概要): Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

論文の概要: Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

arxiv url: http://arxiv.org/abs/2105.09297v1
Date: Fri, 14 May 2021 06:26:22 GMT
ステータス: 翻訳完了
システム内更新日: 2021-05-20 18:30:42.598218
Title: Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application
Title（参考訳）: 長期文書から可変深度論理文書階層を抽出する:方法,評価,応用
Authors: Rongyu Cao and Yixuan Cao and Ganbin Zhou and Ping Luo
Abstract要約: 我々は、長いドキュメント(HELD)から階層抽出(Hierarchy extract)というフレームワークを開発し、各物理オブジェクトを現在のツリーの適切な位置に「逐次」挿入する。中国、イギリスの金融市場、イギリスの科学出版物から何千もの長い文書に基づく実験。本稿では,下流経路検索タスクの性能向上に論理文書階層を用いる方法を提案する。
参考スコア（独自算出の注目度）: 21.270184491603864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications. However, long documents, containing hundreds or even thousands of pages and variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper on of the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. The empirical experiments based on thousands of long documents from Chinese, English financial market and English scientific publication show that the HELD model with the "root-to-leaf" traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.9726, 0.7291 and 0.9578 in Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.
Abstract（参考訳）: 本稿では,長文から可変深度「論理文書階層」を抽出する問題,すなわち,認識された「物理文書オブジェクト」を階層構造に整理する問題について検討する。論理文書階層の発見は多くの下流アプリケーションをサポートするための重要なステップである。しかし、数百から数千ページのページと可変深度階層を含む長いドキュメントは、既存の手法に挑戦する。これらの課題に対処するため,Hyerarchy extract from Long Document (HELD) というフレームワークを開発し,各物理オブジェクトを現在のツリーの適切な位置に「逐次」挿入する。各可能な位置が正しいか否かを決定することは二項分類問題として定式化することができる。提案手法の有効性と効率性をさらに向上するため,挿入位置のトラバース順序,明示的にあるいは暗黙的に抽出する方向,先行ステップでの挿入誤りに対する耐性などを含むHELDの設計変異について検討した。中国語、英語の金融市場、英語の科学出版物からの数千の長い文書に基づく実証実験では、中国金融、英語の金融、arxivのデータセットにおいて、有効性と効率のトレードオフを達成するには「ルート・トゥ・リーフ」のトラバース順序と明示的な見出し抽出が最適であることが示された。最後に,論理文書階層を用いて下流通路検索タスクの性能を大幅に向上できることを示す。まとめると、我々はこの課題を手法、評価、応用の観点から体系的に研究する。

関連論文リスト

DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval [51.89673002051528]
DISRetrievalは、言語談話構造を利用して長い文書理解を強化する新しい階層的検索フレームワークである。本研究は,談話構造が文書の長さや問合せの種類によって検索効率を著しく向上することを確認する。
論文参考訳（メタデータ） (2025-05-26T14:45:12Z)
UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis [7.057192434574117]
我々は、UniHDSAと呼ばれるHDSAの統一的な関係予測手法を提案する。 UniHDSAは、様々なHDSAサブタスクを関係予測問題として扱い、関係予測ラベルを統一ラベル空間に統合する。 UniHDSAの有効性を検証するために,Transformerアーキテクチャに基づくマルチモーダル・エンド・ツー・エンド・システムを開発した。
論文参考訳（メタデータ） (2025-03-20T06:44:47Z)
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval [64.44265315244579]
そこで本研究では,様々なレベルで参照文書を整理し,表現するためのツリーベース手法を提案する。我々の手法はReTreeverと呼ばれ、クエリと参照ドキュメントが同様のツリーブランチに割り当てられるように、バイナリツリーの内部ノード毎のルーティング関数を共同で学習する。我々の評価では、ReTreeverは一般的に完全な表現精度を保っている。
論文参考訳（メタデータ） (2025-02-11T21:35:13Z)
HDT: Hierarchical Document Transformer [70.2271469410557]
HDTは補助的なアンカートークンを導入し、アテンション機構をスパースなマルチレベル階層に再設計することでドキュメント構造を利用する。文書の階層構造を考慮した新しいスパークアテンションカーネルを開発した。
論文参考訳（メタデータ） (2024-07-11T09:28:04Z)
Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis [9.340346869932434]
本稿では,複数のサブタスクを同時に処理する木構築手法を提案する。このフレームワークに基づく効果的なエンドツーエンドソリューションを提案し,その性能を実証する。本システムでは,2つの大規模文書レイアウト解析データセットの最先端性能を実現する。
論文参考訳（メタデータ） (2024-01-22T12:00:37Z)
Unveiling Document Structures with YOLOv5 Layout Detection [0.0]
本研究では,文書レイアウトの迅速同定と非構造化データの抽出を目的とした,最先端コンピュータビジョンモデルYOLOv5の利用について検討する。主な目的は、文書レイアウトを効果的に認識し、構造化されていないデータを抽出できる自律システムを作ることである。
論文参考訳（メタデータ） (2023-09-29T07:45:10Z)
PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
構造化文書をプレーンテキストとして表現することは、これらの文書をリッチな構造でユーザ精神モデルと矛盾する。本稿では,構造や内容に基づいて,モデルがコンテキストを検索できるPDFTriageを提案する。ベンチマークデータセットは,80以上の構造化文書に900以上の人間が生成した質問からなる。
論文参考訳（メタデータ） (2023-09-16T04:29:05Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
我々は,このタスクemphDocument-Aware Passage Retrieval (DAPR)を提案する。 State-of-The-Art(SoTA)パスレトリバーのエラーを分析しながら、大きなエラー(53.5%)は文書コンテキストの欠如に起因する。提案するベンチマークにより,検索システムの開発・比較を今後行うことができる。
論文参考訳（メタデータ） (2023-05-23T10:39:57Z)
CED: Catalog Extraction from Documents [12.037861186708799]
本稿では,文書をカタログ木に解析するトランジションベースのフレームワークを提案する。 CEDタスクは、非常に長い文書の原文セグメントと情報抽出タスクのギャップを埋める可能性があると考えています。
論文参考訳（メタデータ） (2023-04-28T07:32:00Z)
Fine-Grained Distillation for Long Document Retrieval [86.39802110609062]
ロングドキュメント検索は、大規模コレクションからクエリ関連ドキュメントを取得することを目的としている。知識蒸留は、異質だが強力なクロスエンコーダを模倣することによって、レトリバーを改善するために事実上のものである。本稿では, 長期文書検索のための新たな学習フレームワークFGDを提案する。
論文参考訳（メタデータ） (2022-12-20T17:00:36Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
自動回帰言語モデルは、回答を生成するデファクト標準として現れています。これまでの研究は、探索空間を階層構造に分割する方法を探究してきた。本研究では,検索空間の任意の構造を強制しない代替として,経路内のすべてのngramを識別子として使用することを提案する。
論文参考訳（メタデータ） (2022-04-22T10:45:01Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
本稿では,ジェネレーション方式で証拠を検索する最初のシステムであるGEREを提案する。 FEVERデータセットの実験結果は、GEREが最先端のベースラインよりも大幅に改善されていることを示している。
論文参考訳（メタデータ） (2022-04-12T03:49:35Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。