Fugu-MT 論文翻訳(概要): LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

論文の概要: LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

arxiv url: http://arxiv.org/abs/2603.06198v1
Date: Fri, 06 Mar 2026 12:10:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.671057
Title: LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Title（参考訳）: LIT-RAGBench:Retrieval-Augmented Generationにおける大規模言語モデルのベンチマーク生成機能
Authors: Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki,
Abstract要約: Retrieval-Augmented Generation (RAG) は、Large Language Model (LLM) のようなジェネレータが、Retrieverを使用して外部コレクションからドキュメントを取得することで、回答を生成するフレームワークである。既存のGeneratorのベンチマークは限定的なカバレッジを提供しており、統一された条件下で複数の機能の同時評価は不可能である。我々は、LIT-RAGBenchを導入し、統合、推論、論理、表、無視の5つのカテゴリを定義し、それぞれが実際的な評価の側面に分かれている。
参考スコア（独自算出の注目度）: 1.1417805445492082
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG) は、Large Language Model (LLM) のようなジェネレータが、Retrieverを使用して外部コレクションからドキュメントを取得することで、回答を生成するフレームワークである。実際には、ジェネレータは長いコンテキストからエビデンスを統合し、複数のステップの推論、解釈テーブルを実行し、エビデンスが欠落している時に断定しなければなりません。しかしながら、ジェネレータの既存のベンチマークは限定的なカバレッジを提供しており、統一された条件下で複数の機能の同時評価は不可能である。既存の評価と実用のギャップを埋めるために,LIT-RAGBench (Logic, Integration, Table, Reasoning, Abstention RAG Generator Benchmark)を導入する。 LIT-RAGBenchは、カテゴリ間で複数の側面を組み合わせたパターンを体系的にカバーしている。架空の実体とシナリオを使用することで、LIT-RAGBenchは提供された外部文書の答えを評価する。このデータセットは、114人の日本人による質問と、人間のキュレーションによる機械翻訳によって生成された英語バージョンで構成されている。 LLM-as-a-Judge を用いて評価・報告を行う。 APIベースとオープンウェイトモデル全体で、全体の90%を超えるモデルはない。 LIT-RAGBenchは、各カテゴリで強度と弱点を測ることによって、実用RAGデプロイメントにおけるモデル選択とRAG特化モデルの構築に有用な指標となる。データセットと評価コードを含むLIT-RAGBenchをhttps://github.com/Koki-Itai/LIT-RAGBenchでリリースします。

論文の概要: LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

関連論文リスト