Fugu-MT 論文翻訳(概要): Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

論文の概要: Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

arxiv url: http://arxiv.org/abs/2605.29249v1
Date: Thu, 28 May 2026 02:09:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.590112
Title: Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research
Title（参考訳）: AI評価と社会科学研究のための予測パワー推論
Authors: Nicolas Emmenegger, Ellery Stahler, Chara Podimata,
Abstract要約: 多くのアプリケーションは、多くの関連するタスクにまたがって統計的に妥当な推論を必要とするが、仮説当たりの高品質なラベルはわずかである。本稿では,タスク固有の推論を保存しながら,関連タスクのラベル付きデータを用いてパワーを向上させる予測型推論フレームワークを提案する。本研究では,ラベルが不足している場合の信頼区間幅を大幅に削減できることを示す。
参考スコア（独自算出の注目度）: 6.716363754264257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many applications require statistically valid inference across many related tasks, while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions. Prediction-powered inference (PPI) uses abundant but inexpensive proxy measurements to improve inference from limited, ground-truth labels, but commonly used methods treat tasks independently and therefore fail to exploit shared structure across related tasks. This limitation is especially important in settings where only a small number of labels are available per task. To address this issue, we introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy. We complement our theoretical findings with experiments on synthetic and semi-synthetic datasets, as well as a case study auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation study, we show that cross-task recalibration can substantially reduce confidence interval widths when labels are scarce.
Abstract（参考訳）: 多くのアプリケーションは、多くの関連するタスクにまたがって統計的に妥当な推論を必要とするが、仮説当たりの高品質なラベルはわずかである。 AI評価では、これらのタスクはプロンプト、サブグループ、仮説をまたいだモデル行動に対応し、社会科学調査では、関連する質問、人口、または測定条件に対応できる。予測駆動推論(英語版) (PPI) は、豊富なが安価なプロキシ測定を用いて、限られた基幹ラベルからの推論を改善するが、一般的に使われている手法はタスクを個別に扱うため、関連するタスク間の共有構造を利用できない。この制限は、タスクごとに少数のラベルしか利用できない設定において特に重要である。この問題に対処するために,タスク固有の推論を保存しながら,関連するタスクのラベル付きデータを用いて効率を向上させるマルチタスク予測型推論フレームワークを提案する。提案手法は, 高精度な点推定と信頼区間を構築するために, タスク内整合とパワーチューニングを維持しながら, クロスタスク・リカレーションを通じて, プロキシ・グラウンド・トゥルース関係の共有構造を利用する。電力調整されたPPIを超える効率向上は, プロキシ-地下構造関係が非線形構造を含む場合にのみ可能であり, アフィン・クロスタスクの校正は, 元のプロキシと漸近的に等価であることを示す。我々は,2024年アメリカ合衆国大統領選挙における選挙関連情報に関する言語モデルを監査するケーススタディとともに,合成および半合成データセットに関する実験を補完する。大規模な人体注釈研究を用いて,ラベルが不足している場合の信頼性区間幅を大幅に削減できることを示す。

論文の概要: Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

関連論文リスト