Fugu-MT 論文翻訳(概要): AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

論文の概要: AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

arxiv url: http://arxiv.org/abs/2604.02592v1
Date: Fri, 03 Apr 2026 00:01:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.250957
Title: AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X
Title（参考訳）: AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X
Authors: Haiwen Li, Michiel A. Bakker,
Abstract要約: 大規模言語モデルは、ソーシャルメディア上でコンテキストファクトチェックを行う上で有望な能力を示す。本稿では,ライブソーシャルメディアプラットフォーム上に展開されたLCMによるファクトチェックのフィールド評価について紹介する。以上の結果から, LLMは高品質で, 広く有用なファクトチェックに有効であることが示唆された。
参考スコア（独自算出の注目度）: 1.2423236865734466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.
Abstract（参考訳）: 大規模な言語モデルは、ソーシャルメディア上での文脈的事実チェックの有望な能力を示しており、深層調査による主張の検証、複数の情報源からの証拠の合成、大規模説明などを行うことができる。しかしながら、以前の研究では、LLMのファクトチェックはベンチマークやクラウドワーカーによる判断を使用して制御された設定でのみ評価されており、これらのシステムがプラットフォーム環境においてどのように機能するかは未解決のままである。 X Community NotesのAIライタ機能を通じて,ライブソーシャルメディアプラットフォームに展開されたLCMベースのファクトチェックの最初のフィールド評価を行った。 LLMライターは、マルチモーダルコンテンツ(テキスト、画像、ビデオ)を扱うマルチステップパイプラインで、Webとプラットフォームネイティブの検索を実行し、コンテキストノートを書いて、1,597ツイートの1,614のメモを書き、42,521ラッカーの108,169のレーティングを使って、同じツイートの1,332の人書きメモと比較した。 LLMと人為的ノートの提示タイミングとレーティング露出の違いにより、ノートレベルのプラットフォーム結果の直接比較は複雑であり、評価レベル分析は個々のレーダ評価をモデル化し、ノートタイプ間でレーダ露出を等しくするノートレベル分析という2つの相補的な戦略を追求する。レーティングレベルの分析では、LLMノートは、異なる政治的視点を持つラッカーに対して、人間のノートよりも肯定的な評価を受けており、LLMノートが党間のコンセンサスを達成する可能性を示唆している。ノートレベル分析は、この利点を裏付ける: 同じ投稿ですべてのノートを評価したラッカーの中で、LCMノートは、非常に高い有用性スコアを達成する。以上の結果から,LLMは高品質で広く有用なファクトチェックの大規模化に寄与するが,実際の評価には制御された環境から欠落するプラットフォームダイナミクスに注意が必要であることが示唆された。

論文の概要: AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

関連論文リスト