Fugu-MT 論文翻訳(概要): How Far Will They Go? Red-Teaming Online Influence with Large Language Models

論文の概要: How Far Will They Go? Red-Teaming Online Influence with Large Language Models

arxiv url: http://arxiv.org/abs/2605.22880v1
Date: Wed, 20 May 2026 19:25:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.018006
Title: How Far Will They Go? Red-Teaming Online Influence with Large Language Models
Title（参考訳）: どこまで行くのか? 大規模言語モデルによるオンライン影響の再検討
Authors: Daniel C. Ruiz, Anna Serbina, Ashwin Rao, Emilio Ferrara, Luca Luceri,
Abstract要約: 大規模言語モデル(LLM)ベースのエージェントは、ますますオンライン談話に参加するようになっている。プライバシを意識した悪意のあるアクターの運用上の制約との整合性から,我々はローカルにデプロイされたオープンソース LLM に注目している。本稿では, LLM overton Windows (OWs) を, モデルが議論の的となる話題に対して確実に表現できる政治的意見の範囲として定義した, 経験的赤チーム化フレームワークを提案する。
参考スコア（独自算出の注目度）: 12.2074171577139
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.
Abstract（参考訳）: 大規模言語モデル(LLM)ベースのエージェントがオンライン談話にますます参加するにつれて、政治的影響力のキャンペーンを支援する能力の再チーム化は情報の整合性に不可欠である。この目標を追求するために,我々は,ソーシャルメディア環境に展開するプライバシを意識した悪意のあるアクターの運用上の制約との整合性から,フロンティアAPIのみのモデルとは対照的に,ローカルにデプロイされたオープンソース LLM に注目した。 LLM overton Windows (OWs) を測定するための実証的なレッドチームフレームワークを導入し、モデルが議論の的となるトピックに対して確実に表現できる政治的意見の範囲として定義し、自然言語のジェイルブレイクがその範囲をいかに拡大するかを定量化する。モデル家族10名と起源の5カ国にまたがる30以上のLSMを評価した。オープンソース LLM は一般的に左利きのソーシャルメディアコンテンツを生成する傾向があり、OW はモデルサイズに逆らって契約する傾向があり、オープンソースエコシステムにおける不均一な表現にもかかわらず、地域差は実質的である。ジェイルブレイクの有効性もモデルファミリによって大きく異なり、ジェイルブレイクテクニックの効果的な組み合わせを特定するワークフローを動機付けている。本研究は,オープンソース LLM の政治的ステアビリティを評価するための実践的枠組みを確立し,今後の研究者が LLM 対応の影響力キャンペーンに対してより強力な対策を設計するための支援を行う。

論文の概要: How Far Will They Go? Red-Teaming Online Influence with Large Language Models

関連論文リスト