Fugu-MT 論文翻訳(概要): Computational Turing Test Reveals Systematic Differences Between Human and AI Language

論文の概要: Computational Turing Test Reveals Systematic Differences Between Human and AI Language

arxiv url: http://arxiv.org/abs/2511.04195v1
Date: Thu, 06 Nov 2025 08:56:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.373159
Title: Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Title（参考訳）: 計算チューリングテストで人間とAI言語の体系的差異が明らかになった
Authors: Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, Christopher Barrie,
Abstract要約: 大規模言語モデル(LLM)は、人間の振る舞いをシミュレートするために社会科学でますます使われている。既存の検証努力は人的判断に基づく評価に大きく依存している。本稿では,LLMが人間の言語にどの程度近いかを評価するために,計算チューリングテストを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.
Abstract（参考訳）: 大規模言語モデル(LLM)は、現実的な人間的なテキストを生成できるという仮定に基づいて、人間の振る舞いをシミュレートするために社会科学でますます使われている。しかし、この仮定はほとんど証明されていない。既存の検証努力は、人間の判断に基づく評価(人間が人間の成果とAIを区別できるかどうかをテストする)に大きく依存している。その結果、フィールドにはLLM生成テキストのリアリズムを評価するための堅牢なツールや、実世界のデータに対するモデルの校正のためのツールが欠けている。この論文には2つの貢献がある。まず,集合的メトリクス(BERTに基づく検出性とセマンティック類似性)を解釈可能な言語特徴(立体的マーカーとトピックパターン)と統合し,LLMが与えられたデータセット内の人間の言語にどの程度近いかを評価する検証フレームワークであるチューリングテストを紹介する。第2に、微調整、スタイリスティックなプロンプト、コンテキスト検索を含む5つのキャリブレーション戦略の9つのオープンウェイトLCMを体系的に比較し、X(旧Twitter)、Bluesky、Redditでユーザインタラクションを再現する能力をベンチマークします。私たちの発見は文献の中核的な仮定に挑戦する。校正後も、LLMの出力は人間のテキスト、特に感情的なトーンや感情的な表現と明確に区別できるままである。インストラクションチューニングされたモデルは、ベースモデルよりも優れており、モデルサイズをスケールアップしても、人間の類似性は向上しない。人間の様相を最適化するというのは、意味的な忠実さの犠牲になることが多いし、その逆もある。これらの結果は、LLMシミュレーションのバリデーションとキャリブレーションのための、待望のスケーラブルなフレームワークを提供する。

論文の概要: Computational Turing Test Reveals Systematic Differences Between Human and AI Language

関連論文リスト