Fugu-MT 論文翻訳(概要): E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

論文の概要: E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

arxiv url: http://arxiv.org/abs/2604.17319v1
Date: Sun, 19 Apr 2026 08:18:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.461923
Title: E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition
Title（参考訳）: E2E-GMNER: エンド・ツー・エンド生成型マルチモーダル名前付きエンティティ認識
Authors: Meng Zhang, Jinzhong Ning, Xiaolong Wu, Hongfei Lin, Yijia Zhang,
Abstract要約: Grounded Multimodal Named Entity Recognition (GMNER) は、テキスト中の名前付きエンティティの参照を共同で識別することを目的としている。 E2E-GMNERは、エンティティ認識、セマンティックタイピング、視覚的接地、暗黙的知識推論を統一する完全なエンドツーエンド生成フレームワークである。
参考スコア（独自算出の注目度）: 33.81090014865745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk-Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate that E2E-GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end-to-end optimization and noise-aware grounding supervision. Code is available at:https://github.com/Finch-coder/E2E-GMNER
Abstract（参考訳）: Grounded Multimodal Named Entity Recognition (GMNER) は、テキスト中の名前付きエンティティの参照を共同で識別し、それらの意味型を予測し、各エンティティを関連画像内の対応する視覚領域にグラウンドすることを目的としている。既存のアプローチは主に、テキストエンティティ認識と視覚的グラウンドを分離するパイプラインベースのアーキテクチャを採用しており、エラーの蓄積と最適部分の関節最適化につながっている。本稿では,エンティティ認識,セマンティックタイピング,視覚的グラウンドディング,暗黙的知識推論を単一マルチモーダルな大規模言語モデル内に統一する,完全なエンドツーエンド生成フレームワークであるE2E-GMNERを提案する。 GMNERを命令調整条件生成タスクとして定式化し、チェーン・オブ・シント推論を導入し、視覚的エビデンスや背景知識が情報化されるタイミングをモデルが適応的に決定できるようにし、ノイズの多い手がかりへの依存を減らす。生成的バウンディングボックス予測の不安定性にさらに対処するため,ハードボックスの監視を確率的に乱れたソフトターゲットに置き換えたGaussian Risk-Aware Box Perturbation (GRBP)を導入する。 Twitter-GMNERとTwitter-FMNERGベンチマークの大規模な実験は、E2E-GMNERが最先端の手法と比較して高い競争力を発揮することを示した。コードは、https://github.com/Finch-coder/E2E-GMNERで入手できる。

論文の概要: E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

関連論文リスト