Fugu-MT 論文翻訳(概要): AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans

論文の概要: AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans

arxiv url: http://arxiv.org/abs/2509.16530v1
Date: Sat, 20 Sep 2025 04:40:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:15.839216
Title: AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans
Title（参考訳）: AIPsychoBench: LLMと人間との心理的差異を理解する
Authors: Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Enze Wang, Kai Chen, Xiaobing Sun, Baosheng Wang,
Abstract要約: 数十億のパラメータを持つ大規模言語モデル(LLM)は、膨大なインターネットスケールのデータから学習することで、人間のような知性を示す。本稿では,LLMの心理的特性を評価するための特別なベンチマークであるAIPsychoBenchを紹介する。
参考スコア（独自算出の注目度）: 15.572185318032139
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have exhibited human-like intelligence by learning from vast amounts of internet-scale data. However, the uninterpretability of large-scale neural networks raises concerns about the reliability of LLM. Studies have attempted to assess the psychometric properties of LLMs by borrowing concepts from human psychology to enhance their interpretability, but they fail to account for the fundamental differences between LLMs and humans. This results in high rejection rates when human scales are reused directly. Furthermore, these scales do not support the measurement of LLM psychological property variations in different languages. This paper introduces AIPsychoBench, a specialized benchmark tailored to assess the psychological properties of LLM. It uses a lightweight role-playing prompt to bypass LLM alignment, improving the average effective response rate from 70.12% to 90.40%. Meanwhile, the average biases are only 3.3% (positive) and 2.1% (negative), which are significantly lower than the biases of 9.8% and 6.9%, respectively, caused by traditional jailbreak prompts. Furthermore, among the total of 112 psychometric subcategories, the score deviations for seven languages compared to English ranged from 5% to 20.2% in 43 subcategories, providing the first comprehensive evidence of the linguistic impact on the psychometrics of LLM.
Abstract（参考訳）: 数十億のパラメータを持つ大規模言語モデル(LLM)は、膨大なインターネットスケールのデータから学習することで、人間のような知性を示す。しかし、大規模ニューラルネットワークの非解釈性は、LLMの信頼性に関する懸念を引き起こす。研究は、人間の心理学から概念を借用し、その解釈可能性を高めることで、LSMの心理測定特性の評価を試みたが、LSMと人間の根本的な違いを考慮できなかった。これにより、ヒトのスケールを直接再利用する場合、高い拒絶率が得られる。さらに、これらの尺度は、異なる言語におけるLLMの心理的特性の変動の測定をサポートしない。本稿では,LLMの心理的特性を評価するための特別なベンチマークであるAIPsychoBenchを紹介する。軽量なロールプレイングプロンプトを使用してLCMアライメントをバイパスし、平均有効応答率を70.12%から90.40%に改善した。一方、平均偏差は3.3%(正)と2.1%(負)で、それぞれ9.8%と6.9%の偏差よりかなり低い。さらに、合計で112のサイコメトリックサブカテゴリの中で、英語と比較して7つの言語のスコア偏差は43のサブカテゴリで5%から20.2%まで変化しており、LLMの心理メトリクスに対する言語学的影響の最初の包括的証拠となっている。

関連論文リスト

Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding [12.703061322251093]
Small Language Models (SLM) は、Large Language Models (LLM) のプライバシ保護代替品である。本稿では,現在SLMのメンタルヘルス理解能力について,分類タスクの体系的評価を通じて検討する。我々の研究は、メンタルヘルス理解におけるSLMの可能性を強調し、センシティブなオンラインテキストデータを分析するための効果的なプライバシー保護ツールであることを示す。
論文参考訳（メタデータ） (2025-07-09T02:40:02Z)
Cognitive phantoms in LLMs through the lens of latent variables [0.3441021278275805]
大規模言語モデル(LLM)はますます現実のアプリケーションに到達し、それらの振る舞いをよりよく理解する必要がある。近年のLCMに対する心理測定調査では、LLMの人間らしい特徴が報告されており、潜在的に影響する可能性がある。このアプローチは有効性の問題に悩まされており、これらの特性がLLMに存在し、人間用に設計されたツールで測定可能であることを前提としている。本研究では,人間と3人のLDMの潜在的性格構造を2つの評価されたパーソナリティアンケートを用いて比較することにより,この問題を考察する。
論文参考訳（メタデータ） (2024-09-06T12:42:35Z)
Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis [4.59804401179409]
我々は6種類のLDM(GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, Cohere Command R Plus)を用いて、人間の答えに類似した心理測定特性を持つ応答を生成する。その結果,一部のLLMは大学生に比べて,カレッジ・アルジェブラの習熟度が高いことが示唆された。
論文参考訳（メタデータ） (2024-07-15T16:49:26Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
我々は、ある人口層に対する大きな言語モデルの暗黙の偏見を厳格に評価する。心理測定の原則にインスパイアされた我々は,3つの攻撃的アプローチ,すなわち,軽視,軽視,指導を提案する。提案手法は,LLMの内部バイアスを競合ベースラインよりも効果的に引き出すことができる。
論文参考訳（メタデータ） (2024-06-20T06:42:08Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) は、外部の知識源を活用して、事実の幻覚を減らすことで、Large Language Model (LLM) を出力する。 NoMIRACLは18言語にまたがるRAGにおけるLDM堅牢性を評価するための人為的アノテーション付きデータセットである。本研究は,Halucination rate,Halucination rate,Halucination rate,Sorucination rate,Sorucination rate,Sorucination rate,Sorucination rate,Sorucination rate,Sorucination rate,Sr。
論文参考訳（メタデータ） (2023-12-18T17:18:04Z)
Psychometric Predictive Power of Large Language Models [32.31556074470733]
命令チューニングは、認知モデルの観点から、必ずしも人間のような大きな言語モデルを作るとは限らない。命令調整 LLM で推定される次の単語確率は、基本 LLM で推定されるものよりも、人間の読み動作をシミュレートする場合には、しばしば悪化する。
論文参考訳（メタデータ） (2023-11-13T17:19:14Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
大規模言語モデル(LLM)が人間の反応バイアスをどの程度反映しているかについて検討する。アンケート調査では, LLMが人間のような応答バイアスを示すかどうかを評価するためのデータセットとフレームワークを設計した。 9つのモデルに対する総合的な評価は、一般のオープンかつ商用のLCMは、一般的に人間のような振る舞いを反映しないことを示している。
論文参考訳（メタデータ） (2023-11-07T15:40:43Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。