Fugu-MT 論文翻訳(概要): Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

論文の概要: Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

arxiv url: http://arxiv.org/abs/2603.26898v1
Date: Fri, 27 Mar 2026 18:17:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.686263
Title: Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
Title（参考訳）: マジックワードとメソジカルワーク : LLMによる政治テキスト注釈における伝統的な知恵の充足
Authors: Lorca McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde,
Abstract要約: 政治学者はテキストアノテーションに大規模言語モデル(LLM)を急速に採用している。モデル選択、モデルサイズ、学習アプローチ、プロンプトスタイルの相互作用の仕方、一般的な"ベストプラクティス"が制御された比較を生き残るかどうかは、ほとんど調査されていない。我々は、同じ量化、ハードウェア、即席の条件下で、4つの政治科学アノテーションタスクにまたがる6つのオープンウェイトモデルをテストする。
参考スコア（独自算出の注目度）: 1.5744532332166479
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide both to cost and to performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within model families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects on annotation performance. We use these benchmark results to develop a validation-first framework - with a principled ordering of pipeline decisions, guidance on prompt freezing and held-out evaluation, reporting standards, and open-source tools - to help researchers navigate this decision space transparently.
Abstract（参考訳）: 政治学者はテキストアノテーションに大規模言語モデル(LLM)を急速に採用しているが、実装選択に対するアノテーションの感度は未だによく分かっていない。ほとんどの評価では、モデル選択、モデルサイズ、学習アプローチ、プロンプトスタイルの相互作用、一般的な"ベストプラクティス"が制御された比較で生き残るかどうかなど、単一のモデルや構成をテストする。我々はこれらのパイプライン選択の制御された評価を行い、同じ量化、ハードウェア、即席条件下で4つの政治科学アノテーションタスクにまたがる6つのオープンウェイトモデルをテストする。私たちの中心的な発見は方法論的であり、相互作用効果が主な効果を支配しているため、一見合理的なパイプライン選択が連続的な研究者の自由度になる可能性がある。単一モデル、プロンプトスタイル、あるいは学習アプローチが一様に優れているわけではなく、最高のパフォーマンスモデルはタスクによって異なる。 2つのコースが続く。まず、モデルサイズはコストとパフォーマンスの両方に対する信頼性の低いガイドである: クロスファミリー効率の相違は非常に大きいので、より大きなモデルでは、より小さな選択肢よりもリソース集約的なものが多く、一方、モデルファミリー内のミッドレンジの変種は、より大きなモデルと一致するか、あるいは超えることが多い。第二に、広く推奨されるプロンプトエンジニアリング技術は、アノテーションのパフォーマンスに一貫性がなく、時にはネガティブな影響をもたらす。これらのベンチマーク結果を使って、パイプライン決定の原則的な順序付け、凍結とホールドアウト評価の迅速化に関するガイダンス、レポート標準、オープンソースツールなど、バリデーションファーストのフレームワークを開発し、研究者がこの決定空間を透過的にナビゲートするのを支援する。

論文の概要: Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

関連論文リスト