Fugu-MT 論文翻訳(概要): TaskEval: Synthesised Evaluation for Foundation-Model Tasks

論文の概要: TaskEval: Synthesised Evaluation for Foundation-Model Tasks

arxiv url: http://arxiv.org/abs/2512.04442v1
Date: Thu, 04 Dec 2025 04:19:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-05 21:11:45.983479
Title: TaskEval: Synthesised Evaluation for Foundation-Model Tasks
Title（参考訳）: TaskEval: 基礎モデルタスクの合成評価
Authors: Dilani Widanapathiranage, Scott Barnett, Stefanus Kurniawan, Wannita Takerngsaksiri,
Abstract要約: 本稿では、FMタスク固有の評価プログラムを合成し、フィードバックをキャプチャするための自動化とカスタムUIを提供する手法を提案する。提案手法の中核的な特徴は,(1)FMタスクの特性を捉えたタスク非依存メタモデル,(2)人間のフィードバックを効率的に活用するためのインタラクションプロトコル,(3)適切なevalのセットを選択したり生成したりするevalシンセサイザーである。
参考スコア（独自算出の注目度）: 1.0219621548854343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval synthesiser that selects or generates an appropriate set of evals. We implement our approach in \toolname and demonstrate the concept on two diverse FM tasks: chart data extraction and document question answering. A preliminary evaluation on the quality of our selected evals shows 93\% and 90\% accuracy respectively. Our research tackles a growing problem facing engineering teams, how to evaluate and review outputs from FM tasks.
Abstract（参考訳）: ファンデーションモデル(FM)に依存するアプリケーションを作成する場合、幻覚は重要な関心事である。アプリケーション内でこれらの微妙な障害の発生場所と方法を理解するには、‘textit{evals} と呼ばれる評価方法に依存する。以前の作業では、特定のタスクのための新しいevalメソッドやベンチマークデータセットの定義に重点を置いていた。しかし、メトリクスやデータセットがない場合、ソフトウェアチームがタスク固有のFMアプリケーションを使うのにも役立ちません。自動化アプローチと人間の洞察の深い統合の両方の需要は、この問題を困難な問題にしている。我々は、FMタスク固有の評価プログラムを合成し、フィードバックをキャプチャするための自動化とカスタムUIを提供するアプローチを提案することで、このギャップに対処する。提案手法の中核的な特徴は,(1)FMタスクの特性を捉えるタスク非依存メタモデル,(2)人間のフィードバックを効率的に活用するためのインタラクションプロトコル,(3)適切なevalを選択あるいは生成するevalシンセサイザーである。提案手法を \toolname で実装し,データ抽出と文書質問応答という2つの異なるFMタスクの概念を実証する。選択したevalsの品質に関する予備評価では, それぞれ93\%, 90\%の精度を示した。我々の研究は、エンジニアリングチームが直面している問題、FMタスクからのアウトプットの評価とレビューの方法に取り組みます。

論文の概要: TaskEval: Synthesised Evaluation for Foundation-Model Tasks

関連論文リスト