Fugu-MT 論文翻訳(概要): KohakuRAG: A simple RAG framework with hierarchical document indexing

論文の概要: KohakuRAG: A simple RAG framework with hierarchical document indexing

arxiv url: http://arxiv.org/abs/2603.07612v1
Date: Sun, 08 Mar 2026 12:52:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.909674
Title: KohakuRAG: A simple RAG framework with hierarchical document indexing
Title（参考訳）: Kohakurag: 階層的なドキュメントインデックス機能を備えたシンプルなRAGフレームワーク
Authors: Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, Buu-Khang Tu,
Abstract要約: 文書構造を4段階のツリー表現で保存する階層型RAGフレームワークであるKohakuragを提案する。われわれはWattBot 2025 Challengeの評価を行った。これは32の文書から技術的質問に答えるシステムを必要とするベンチマークである。
参考スコア（独自算出の注目度）: 1.0844295385744671
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at https://github.com/KohakuBlueleaf/KohakuRAG.
Abstract（参考訳）: 平坦なチャンキング戦略は文書構造を犠牲にし、単一クエリの定式化は語彙ミスマッチによる関連通路を見逃し、単一パス推論は内容と引用選択の両方で異なる確率的回答を生成する。階層的なRAGフレームワークであるKohakuRAGについて述べる。文書構造を4段階のツリー表現(ドキュメント$\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence)でボトムアップ埋め込みアグリゲーションで保存し、クロスクエリでLLMを利用したクエリプランナによる検索カバレッジを改善し、無意識投票によるアンサンブル推論によって回答を安定化する。 We evaluate on the WattBot 2025 Challenge,このベンチマークは、32の文書から、$\pm$0.1%の数値耐性と正確なソース属性を持つ技術的質問に答えるシステムを必要とするベンチマークである。公立と私設のリーダーボード(最終スコア0.861)で1位を獲得し、両評価パーティションで最高位を維持した唯一のチームとなった。アブレーション研究では、即時順序付け (+80%) 、再試行機構 (+69%) 、およびブランクフィルタリング (+1.2pp) によるアンサンブル投票 (+1.2pp) がそれぞれ大きく寄与し、階層的密度の高い検索だけではハイブリッドスパース・デンス・アプローチと一致する(BM25では+3.1ppしか加算されない)。我々はKohakuRAGをhttps://github.com/KohakuBlueleaf/KohakuRAGでオープンソースソフトウェアとしてリリースする。

論文の概要: KohakuRAG: A simple RAG framework with hierarchical document indexing

関連論文リスト