Fugu-MT 論文翻訳(概要): A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

論文の概要: A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

arxiv url: http://arxiv.org/abs/2510.12993v1
Date: Tue, 14 Oct 2025 21:10:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.424206
Title: A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation
Title（参考訳）: LLM セーフガードとパーソナライゼーションと偽情報との相互作用に関する多言語・大規模研究
Authors: João A. Leite, Arnav Arora, Silvia Gargova, João Luz, Gustavo Sampaio, Ian Roberts, Carolina Scarton, Kalina Bontcheva,
Abstract要約: 本稿では,大規模言語モデルによるペルソナ目的の偽情報生成に関する,最初の大規模・多言語的実証研究について述べる。 AI-TRAITSは8つの最先端のLCMによって生成される約1.6万のテキストからなる新しいデータセットである。以上の結果から, 簡単なパーソナライズ戦略を駆使すれば, 全研究LSMに対する脱獄の可能性が著しく高くなることが示唆された。
参考スコア（独自算出の注目度）: 12.577461004484604
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.
Abstract（参考訳）: LLM(Large Language Models)の人間的な習熟度は、説得的で個人化された偽情報を大規模に生成するための誤用の可能性に懸念を抱いている。以前の研究は、LSMは偽情報を生成することができることを示したが、説得性とパーソナライゼーション(特定の人口統計学的属性に合わせた偽情報の生成)に関する具体的な疑問はほとんど研究されていない。本稿では, LLMによるペルソナ標的情報生成に関する, 大規模・多言語的実証的研究について述べる。我々は,レッドチーム方式を用いて,ペルソナ目標のプロンプトに対するLDMの安全性機構の堅牢性を体系的に評価した。 AI-TRAITS(AI-generaTed peRsonAlIsed disinformaTion dataSet)は、8つの最先端LLMによって生成された約1.6万テキストのデータセットである。 AI-TRAITSは、324の偽情報物語と150の異なるペルソナプロファイルを組み合わせて、主要な4つの言語(英語、ロシア語、ポルトガル語、ヒンディー語)と重要な人口動態(国、世代、政治的指向)をカバーするプロンプトによってシードされる。得られたパーソナライズされた物語は定量的に評価され、モデル、言語、ジェイルブレイク率、パーソナライズ属性の次元に沿って比較される。以上の結果から, 簡単なパーソナライズ戦略を駆使すれば, 全研究LSMに対する脱獄の可能性が著しく高くなることが示唆された。さらに、パーソナライズされたプロンプトは言語的および修辞的なパターンを変化させ、LLM生成した偽の物語の説得力を増幅する。これらの知見は、現在のLLMにおける重要な脆弱性を明らかにし、マルチリンガルおよびクロスデモグラフィーのコンテキストにおける安全性アライメントと検出戦略を改善する基盤を提供する。

論文の概要: A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

関連論文リスト