Fugu-MT 論文翻訳(概要): Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

論文の概要: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

arxiv url: http://arxiv.org/abs/2603.03319v1
Date: Mon, 09 Feb 2026 20:55:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.147548
Title: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
Title（参考訳）: LLM-as-a-Judge選好分析のための自動概念発見
Authors: James Wedgwood, Chhavi Yadav, Virginia Smith,
Abstract要約: 大規模言語モデル(LLM)は、モデル出力のスケーラブルな評価手段として、ますます使われています。彼らの選好判断は体系的な偏見を示し、人間の評価から分岐することができる。 LLM判定の動作を解析するための埋め込みレベルの概念抽出法について検討した。
参考スコア（独自算出の注目度）: 21.171990974350773
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.
Abstract（参考訳）: 大規模言語モデル (LLM) は、モデル出力のスケーラブルな評価手段としてますます用いられるが、その選好判断は体系的なバイアスを示し、人間の評価から逸脱することができる。 LLM-as-a-judgeの以前の研究は、仮説化されたバイアスの小さなセットに主に焦点を合わせており、LLMの選好の未知のドライバを自動的に発見するという問題を解き放ったままである。本研究は,LLM判定の動作を解析するための埋め込みレベルの概念抽出手法について検討することによって,このギャップに対処する。我々はこれらの手法を解釈可能性と予測性の観点から比較し、スパースオートエンコーダに基づくアプローチが、LLM決定の予測において競争力を維持しながら、選択肢よりもはるかに解釈可能な選好特性を回復することを発見した。複数人の嗜好データセットと3つのLDMの判断から27k以上のペアの応答を用いて、LCMの判断を分析し、それを人間のアノテーションと比較する。提案手法は,人間よりも高いレートでの機密要求の拒否傾向や,新たな状況へのアプローチにおける具体性や共感を重視した応答に対するバイアス,学術的助言の細部や形式,警察の呼び出しや訴訟訴訟などの積極的な手順を促進する法的ガイダンスなど,一般およびドメイン固有のデータセットにまたがる新たな傾向を明らかにすることを目的としている。以上の結果から, 自動概念発見は, 偏見を考慮せずに, LLM判定の選好を体系的に分析することを可能にすることがわかった。

論文の概要: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

関連論文リスト