Fugu-MT 論文翻訳(概要): RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

論文の概要: RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

arxiv url: http://arxiv.org/abs/2511.04502v1
Date: Thu, 06 Nov 2025 16:22:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.49849
Title: RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
Title（参考訳）: RAGalyst:ドメイン特異的RAGのための自動人為的エージェント評価
Authors: Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere,
Abstract要約: Retrieval-Augmented Generation (RAG) は、LLM(Large Language Models)を実際に証明するための重要な手法である。既存の評価フレームワークは多くの場合、ドメイン固有のニュアンスをキャプチャできないメトリクスに依存します。本稿では,RAGalystについて紹介する。RAGalystは,ドメイン固有のRAGシステムの厳密な評価を目的とした,人力による自動エージェントフレームワークである。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG) は、大規模言語モデル(LLM)を実際に証明するための重要な手法である。既存の評価フレームワークは、しばしば、ドメイン固有のニュアンスを捉えるのに失敗するヒューリスティックなメトリクスに頼っている。本稿では,RAGalystについて紹介する。RAGalystは,ドメイン固有のRAGシステムの厳密な評価を目的とした,人力による自動エージェントフレームワークである。 RAGalystは、ソースドキュメントから高品質で総合的な質問応答(QA)データセットを生成するエージェントパイプラインを備えており、データの忠実性を確保するためにエージェントフィルタリングのステップが組み込まれている。このフレームワークは、2つの重要なLCM-as-a-Judgeメトリクス-Answer correctnessとAnswerability-using prompt Optimizationを洗練し、人間のアノテーションと強く相関する。このフレームワークを適用して、3つの異なるドメイン(軍事運用、サイバーセキュリティ、ブリッジエンジニアリング)にわたる様々なRAGコンポーネントを評価することで、パフォーマンスがコンテキストに依存していることが分かりました。単一の埋め込みモデル、LLM、ハイパーパラメータの構成は、普遍的に最適である。さらに、RAGにおける最も一般的な低解答精度の理由について分析を行った。 RAGalystは、実践者がドメイン固有のトレードオフを明らかにし、信頼性と効果的なRAGシステムを構築するための情報設計選択を行うことを可能にする。 RAGalystはGithubで入手可能です。

論文の概要: RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

関連論文リスト