Fugu-MT 論文翻訳(概要): SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

論文の概要: SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

arxiv url: http://arxiv.org/abs/2604.20146v1
Date: Wed, 22 Apr 2026 03:17:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:10.947954
Title: SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
Title（参考訳）: SAKE: 接地型マルチモーダル名前付きエンティティ認識のための自己認識型知識爆発探索
Authors: Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin,
Abstract要約: Grounded Multimodal Named Entity Recognition (GMNER)は、名前付きエンティティを抽出し、画像とテキストのペア内で視覚領域をローカライズすることを目的としている。オープンワールドのソーシャルメディアプラットフォームでは、GMNERは長い尾を持ち、急速に進化し、目に見えない存在であるため、依然として挑戦的だ。本研究では、内部知識の活用と外部知識探索を調和させるエンドツーエンドのエージェントフレームワークであるSAKEを提案する。
参考スコア（独自算出の注目度）: 28.17858615204594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.
Abstract（参考訳）: Grounded Multimodal Named Entity Recognition (GMNER)は、名前付きエンティティを抽出し、画像とテキストのペア内で視覚領域をローカライズすることを目的としている。オープンワールドのソーシャルメディアプラットフォームでは、GMNERは長い尾を持ち、急速に進化し、目に見えない存在であるため、依然として挑戦的だ。これを解決するために、既存のアプローチは通常、ヒューリスティック検索による外部知識探索と、マルチモーダル大言語モデル(MLLM)における反復的洗練による内部知識活用のいずれかに依存している。しかし、ヒューリスティック検索は、しばしば、既知の実体の精度を低下させるノイズや矛盾する証拠を導入し、一方、内部的な搾取はMLLMの知識境界によって制限され、幻覚を招きやすい。そこで本稿では,自己認識型推論と適応型検索ツールによる内部知識の活用と外部知識探索を調和させる,エンドツーエンドのエージェントフレームワークであるSAKEを提案する。これを2段階のトレーニングパラダイムで実装します。まず、複数のフォワードサンプリングによってモデルの実体レベルの不確実性を定量化し、明示的な知識ギャップ信号を生成するDifficulty-Aware Search Tag Generationを提案する。これらの信号に基づいて、教師付き微調整により基本的自己認識とツール使用能力を備えた高品質なChain-of-ThoughtデータセットであるSAKE-SeCoTを構築する。第二に、不要な検索をペナルティ化するハイブリッド報酬関数を用いたエージェント強化学習を用いて、厳密な検索模倣から、検索が本当に必要なときの真の自己認識決定まで、モデルを進化させることができる。 2つの広く利用されているソーシャルメディアベンチマークに関する大規模な実験は、SAKEの有効性を示している。

論文の概要: SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

関連論文リスト