Fugu-MT 論文翻訳(概要): MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

論文の概要: MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

arxiv url: http://arxiv.org/abs/2603.23533v2
Date: Fri, 27 Mar 2026 05:05:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.023543
Title: MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG
Title（参考訳）: MDKeyChunker: ローリングキーとキーベースの高精度RAG再構成によるシングルコールLDM強化
Authors: Bhavik Mangla,
Abstract要約: RAGパイプラインは通常、ドキュメント構造を無視し、境界を越えたセマンティックユニットを断片化し、メタデータ抽出のためにチャンク毎の複数のLCMコールを必要とする固定サイズのチャンクに依存している。 MDKeyChunkerはMarkdown文書のための3段階パイプラインであり、ヘッダ、コードブロック、テーブル、リストをアトミック単位として扱う構造対応チャンキングを実行する。シングルコール設計では、1つのLSM呼び出しで7つのメタデータフィールド全てを抽出し、フィールド毎の抽出パスを分離する必要がなくなる。ローリングキーの伝搬は、手動のスコアリングをLLMネイティブなセマンティックマッチングに置き換える。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
Abstract（参考訳）: RAGパイプラインは通常、ドキュメント構造を無視し、境界を越えたセマンティックユニットを断片化し、メタデータ抽出のためにチャンク毎の複数のLCMコールを必要とする固定サイズのチャンクに依存している。 MDKeyChunkerは,(1)ヘッダ,コードブロック,テーブル,リストをアトミック単位として扱い,(2)単一のLCMコール抽出タイトル,要約,キーワード,型付きエンティティ,仮説的質問,セマンティックキーを通じて各チャンクを濃縮し,(3)文書レベルのコンテキストを維持するためにローリングキー辞書を伝搬し,(3)同一セマンティックキーを共有することでチャンクを再構築する。シングルコール設計では、1つのLSM呼び出しで7つのメタデータフィールド全てを抽出し、フィールド毎の抽出パスを分離する必要がなくなる。ローリングキーの伝搬は、手動のスコアリングをLLMネイティブなセマンティックマッチングに置き換える。 18文書のMarkdownコーパス上の30クエリに対する実証的な評価では、Config D(BM25 over Structure chunks)がRecall@5=1.000とMRR=0.911を達成する一方で、完全なパイプライン(Config C)上の密度の高い検索はRecall@5=0.867に達する。 MDKeyChunkerはPythonで実装され、4つの依存関係を持ち、あらゆるOpenAI互換エンドポイントをサポートする。

論文の概要: MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

関連論文リスト