Fugu-MT 論文翻訳(概要): A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

論文の概要: A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

arxiv url: http://arxiv.org/abs/2604.18944v1
Date: Tue, 21 Apr 2026 00:46:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.555093
Title: A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition
Title（参考訳）: エンティティ認識を用いたユーザ生成コンテンツに対する情報密度の影響に関するメカニズムと最適化研究
Authors: Jiang Xiaobo, Dinghong Lai, Song Qiu, Yadong Deng, Xinkai Zhan,
Abstract要約: 低情報密度(ID)による表面レベルノイズ症状は根本原因の統一化に寄与していることを示す。「注意スペクトル分析(ASA)を導入し、IDの減少が注意をそらす原因を定量化する。」提案するWindow-Aware Optimization Module (WOM) は LLM を利用したモデルに依存しないフレームワークである。
参考スコア（独自算出の注目度）: 1.005854289245731
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,'' ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5\% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.
Abstract（参考訳）: クリーンで高リソースなコーパスでトレーニングされたエンティティ認識(NER)モデルは、ソーシャルメディアなどの希少なユーザ生成コンテンツ(UGC)上にデプロイされた場合、壊滅的なパフォーマンス崩壊を示す。従来の研究では、ネオロジズム、エイリアスドリフト、非標準正書法、ロングテールエンティティ、クラス不均衡といった問題に対処するために、カスタマイズされた微調整を活用して、ポイントワイドな症状の修復に重点を置いていた。しかしながら、これらの改善はしばしば、UGCに固有の構造的疎結合を見落としているため、一般化に失敗する。本研究は,低情報密度 (ID) の根本原因を表面レベルノイズ症状が共有していることを明らかにする。本稿では,階層的コンバウンディング制御による再サンプリング実験(特にエンティティの規則性やアノテーションの一貫性の制御)を通じて,IDを独立したキーファクタとして同定する。我々は,Attention Spectrum Analysis (ASA)を導入し,IDの因果的削減が,最終的にNER性能の低下につながるかを定量化する。これらの力学的な知見から,LLMを利用したモデルに依存しないフレームワークである Window-Aware Optimization Module (WOM) を提案する。 WOMは、情報スパース領域を特定し、選択的なバックトランスレーションを利用して、モデルアーキテクチャを変更することなく、意味密度を方向付けする。標準的なUGCデータセット(WNUT2017、Twitter-NER、WNUT2016)のメインストリームアーキテクチャ上にデプロイされたWOMは、最大4.5\%の絶対F1の改善、堅牢性の実証、WNUT2017の新たな最先端(SOTA)結果の達成を実現している。

論文の概要: A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

関連論文リスト