Fugu-MT 論文翻訳(概要): Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

論文の概要: Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

arxiv url: http://arxiv.org/abs/2605.21338v1
Date: Wed, 20 May 2026 16:05:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.769386
Title: Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
Title（参考訳）: テキスト分析評価フレームワーク : LLMとソーシャルメディアを事例として
Authors: Yuefeng Shi, Nedjma Ousidhoum, Jose Camacho-Collados,
Abstract要約: 質問に基づく評価フレームワークを導入する。さまざまなNLPをカバーするさまざまなTwitterデータセットにベンチマークを適用した。感情分析、ヘイトスピーチ検出、感情認識などのタスク。この結果から,入力スケールとデータソースの複雑さに大きく依存していることが判明した。
参考スコア（独自算出の注目度）: 4.065252374657746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.
Abstract（参考訳）: LLMは幅広いNLPタスクにおいて例外的な熟練度を示した。しかし、特にLLMがニュースフィードなどの非構造化文書の長いシーケンスを処理する必要がある場合、特にソーシャルメディア投稿において顕著なギャップが残っている。この設定におけるLLMの有効性を実証的に評価するために、集約されたテキストデータに対するLLMの意味的理解と推論能力を評価するために、手作業による470の質問からなる質問ベース評価フレームワークを導入する。我々は、感情分析、ヘイトスピーチ検出、感情認識など、さまざまなNLPタスクをカバーする多様なTwitterデータセットにベンチマークを適用した。以上の結果から,入力スケールやデータソースの複雑さに大きく依存することが明らかとなり,マルチラベルやターゲット依存のシナリオでは顕著に低下することがわかった。さらに、タスクの複雑さが増大するにつれて、基本的な意味的存在の識別から、比較、カウント、計算といったより要求の多い操作へと、パフォーマンスが徐々に低下する。さらに、入力サイズが500を超えると、LLM、特にOpen-weightsモデルに共通する制限が特定される。これらの知見は、大規模テキストコレクション上で厳密な定量的分析を行うために、現在のLLMにおける重要なアーキテクチャ上のボトルネックを浮き彫りにしている。

論文の概要: Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

関連論文リスト