Fugu-MT 論文翻訳(概要): IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

論文の概要: IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

arxiv url: http://arxiv.org/abs/2509.02855v1
Date: Tue, 02 Sep 2025 21:58:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.350585
Title: IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations
Title（参考訳）: IDEAlign:オープンエンド解釈アノテーションにおける大規模言語モデルと人間専門家の比較
Authors: Hyunji Nam, Lucia Langlois, James Malamut, Mei Tan, Dorottya Demszky,
Abstract要約: 大規模言語モデル(LLM)は、オープンエンドで解釈可能なアノテーションタスクにますます適用されている。現在、アイデアの類似性の検証されたスケーラブルな尺度は存在しない。
参考スコア（独自算出の注目度）: 5.5560439396390455
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a "pick-the-odd-one-out" triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.
Abstract（参考訳）: 大規模言語モデル(LLM)は、研究者によるテーマ分析や教師による学生の仕事に対するフィードバックなど、オープンエンドで解釈可能なアノテーションタスクにますます適用されてきている。これらのタスクには、特定の目的(例えば、研究質問や指導目標)に基づいて専門家レベルの判断を必要とする自由テキストアノテーションが含まれる。 LLM生成アノテーションが専門家によって生成されたアノテーションと一致しているかどうかを評価することは、大規模に行うのが困難であり、現在、アイデアの類似性に関する検証済みでスケーラブルな尺度は存在しない。本稿では, i)LLMによる解釈アノテーションのスケーラブルな評価を批判的かつ実証的なタスクとして導入する。 i) IDEAlginを提案する。これは、"pick-the-odd-one-out"三重項判定タスクを通じて、専門家の類似度評価を取得するための、直感的なベンチマークパラダイムである。 3) IDEAlgin によるベクトルベース(トピックモデル,埋め込み)や LLM-as-a-judge などの類似度指標を,これらのヒトベンチマークに対して評価した。このアプローチを実世界の2つの教育データセット(解釈分析とフィードバック生成)に適用すると、ベクトルベースのメトリクスは、専門家にとって有意義な類似性の微妙な次元を捉えるのにほとんど失敗していることがわかる。 IDEAlginによるLCMのプロンプティングは、従来の語彙やベクトルベースのメトリクスと比較して、専門家の判断(9～30%)との整合性を大幅に改善する。これらの結果は、LLMを大規模に拡張した専門家アノテーションに対して評価するための有望なパラダイムとしてIDEAlginを確立し、教育およびそれ以上にLLMの責任ある展開を通知する。

論文の概要: IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

関連論文リスト