Fugu-MT 論文翻訳(概要): DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

論文の概要: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

arxiv url: http://arxiv.org/abs/2509.04499v1
Date: Tue, 02 Sep 2025 00:32:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-08 14:27:25.335695
Title: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
Title（参考訳）: DeepTRACE: サイテーションとエビデンスを越えて信頼性を追跡するためのディープリサーチAIシステムの検討
Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu,
Abstract要約: 生成検索エンジンと深層研究のLLMエージェントは、信頼できるソース・グラウンドの合成を約束するが、ユーザーは常に過剰な自信、弱いソーシング、紛らわしい引用の慣行に遭遇する。 DeepTRACEは、社会技術的に基礎をおく新しい監査フレームワークで、コミュニティが特定した失敗事例を、回答テキスト、情報源、引用にまたがる8つの測定可能な次元に変換する。
参考スコア（独自算出の注目度）: 50.97612134791782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.
Abstract（参考訳）: 生成検索エンジンと深層研究のLLMエージェントは、信頼できるソース・グラウンドの合成を約束するが、ユーザーは常に過剰な自信、弱いソーシング、紛らわしい引用の慣行に遭遇する。 DeepTRACEは、社会技術的に基礎をおく新しい監査フレームワークで、コミュニティが特定した失敗事例を、回答テキスト、情報源、引用にまたがる8つの測定可能な次元に変換する。 DeepTRACEは、ステートメントレベルの分析(分解、信頼性スコアリング)を使用し、引用および事実支援行列を構築して、システムがどのように原因を判断し、エンドツーエンドの証拠を属性化するかを監査する。一般的なパブリックモデル(例えば、GPT-4.5/5、You.com、Perplexity、Copilot/Bing、Gemini)の自動抽出パイプラインと、人間のレーダとの適合性を検証するLCM-judgeを使用して、Web検索エンジンとDeep-Research構成の両方を評価する。この結果から, 生成検索エンジンと深層調査エージェントは, 議論クエリに対して, 片面, 高い信頼性の応答を頻繁に生成し, それぞれの情報源が支持する文が多数含まれていることが示唆された。 Deep-Researchの設定は、過剰な自信を減らし、高い引用完全性を達成することができるが、議論のクェリに非常に片面的であり、まだサポートされていないステートメントのかなりの部分を示しており、引用精度はシステム全体で40～80%である。

論文の概要: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

関連論文リスト