Fugu-MT 論文翻訳(概要): RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

論文の概要: RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

arxiv url: http://arxiv.org/abs/2604.19047v1
Date: Tue, 21 Apr 2026 03:54:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.605715
Title: RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
Title（参考訳）: RARE:高相似コーパスのための冗長性を考慮した検索評価フレームワーク
Authors: Hanjun Cho, Jay-Yoon Lee,
Abstract要約: 本稿では,文書をアトミックな事実に分解することで,現実的なベンチマークを構築するためのフレームワークを提案する。またRedQAでは、4-hop General-Wikiで66.4% PerfRecall@10から5.0-27.9% PerfRecall@10に4-hopでダウンする。
参考スコア（独自算出の注目度）: 11.316299961548415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.
Abstract（参考訳）: 既存のQAベンチマークでは、重複が最小限であるが、実際の検索拡張世代(RAG)システムは、財務報告、法典、特許などのコーパスで運用される。このミスマッチは、評価の妥当性を損なう: ドキュメント間の冗長性は評価において考慮されないため、十分な証拠を提供する文書を検索しても、レトリバーは不公平に過小評価される。一方、標準ベンチマークでよく機能するレトリバーは、非常によく似た冗長なドキュメントを持つ現実世界のコーパスによく一般化される。現実的なベンチマークを構築するためのフレームワークであるRARE(Redundancy-Aware Retrieval Evaluation)を提案する。一書類を原子事実に分解し、正確な冗長性追跡を可能にすること。 (II)CRRFによるLCMベースのデータ生成の強化。 RAGベンチマークデータは通常、複数の品質基準を必要とするが、LLMは自明な出力を得ることが多い。 CRRFは、基準を個別にスコアし、ランクによる決定を融合し、生成されたデータの信頼性を向上させる。 RAREをファイナンス、法律、特許のコーパスに適用して、RedQAを紹介します。これは、4-hop General-Wikiで66.4% PerfRecall@10から5.0-27.9% PerfRecall@10に4-hopでダウンし、現在のベンチマークで取得できない堅牢性ギャップを明らかにします。 RAREにより、実践者は現実世界のデプロイメント条件を忠実に反映したドメイン固有のRAG評価を構築することができる。

論文の概要: RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

関連論文リスト