Fugu-MT 論文翻訳(概要): Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

論文の概要: Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

arxiv url: http://arxiv.org/abs/2603.20781v1
Date: Sat, 21 Mar 2026 12:16:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.080416
Title: Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement
Title（参考訳）: Code-MIE: シーングラフとエンティティ属性知識強化によるマルチモーダル情報抽出のためのコードスタイルモデル
Authors: Jiang Liu, Ge Qiu, Hao Fei, Dongdong Xie, Jinbo Li, Fei Li, Chong Teng, Donghong Ji,
Abstract要約: コード型マルチモーダル情報抽出フレームワーク(Code-MIE)を提案する。 Code-MIEは、MIEを統一されたコード理解と生成として定式化する。提案手法は6つの競合するベースラインモデルと比較して最先端の性能を実現する。
参考スコア（独自算出の注目度）: 32.720833540821125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M$^3$D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な発展に伴い、LLMに基づく情報抽出に注目する研究者が増えている。しかし、既存の関連手法では改善すべき空間がまだいくつか残っている。まず、既存のマルチモーダル情報抽出法(MIE)では、通常、自然言語テンプレートをLPMの入力と出力として使用し、エンティティやリレーションシップなどの構造化情報を含む情報タスクの特徴とミスマッチする。第二に、いくつかのメソッドが構造化され、よりIEフレンドリなコードスタイルのテンプレートを採用していますが、彼らは、マルチモーダルなIEではなく、テキストのみのIEでメソッドを探索しました。さらに、それらのメソッドは設計が複雑で、各タスクのために別々のテンプレートを設計する必要がある。本稿では,MIEを統一的なコード理解と生成として定式化する,コードスタイルのマルチモーダル情報抽出フレームワーク(Code-MIE)を提案する。 Code-MIE は,(1) ジェンダー,アフィリエイトなどのエンティティ属性をテキストから抽出し,エンティティのコンテキストや役割を理解するためのモデルを示す。 2)画像はシーングラフや視覚特徴に変換され,リッチな視覚情報をモデルに組み込む。 (3) 入力テンプレートはPython関数として構築され、エンティティ属性、シーングラフ、生テキストが関数パラメータを構成する。対照的に、出力テンプレートは、エンティティやリレーションなどのすべての抽出結果を含むPython辞書として形式化されている。 Code-MIEを評価するために、M$^3$D、Twitter-15、Twitter-17、MNREデータセットについて広範な実験を行った。その結果,M$^3$D,76.04\%,88.07\%,73.94\%の英語と中国語のデータセットに対して61.03\%,60.49\%の競合するベースラインモデルと比較して,最先端の性能が得られた。

論文の概要: Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

関連論文リスト