Fugu-MT 論文翻訳(概要): WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

論文の概要: WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

arxiv url: http://arxiv.org/abs/2603.09921v1
Date: Tue, 10 Mar 2026 17:18:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.493872
Title: WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Title（参考訳）: WikiCLIP: オープンドメインビジュアルエンティティ認識のための効率的なコントラストベースライン
Authors: Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He,
Abstract要約: オープンドメインビジュアルエンティティ認識(VER)は、ウィキペディアのような百科事典の知識基盤のエンティティとイメージを関連付けようとする。 VERに適した最近の生成手法は、高い性能を示すが、高い計算コストがかかる。オープンドメイン VER の強力な,効率的なベースラインを確立するフレームワークである WikiCLIP を紹介する。
参考スコア（独自算出の注目度）: 18.56932287056642
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
Abstract（参考訳）: オープンドメインビジュアルエンティティ認識(VER)は、ウィキペディアのような百科事典の知識基盤のエンティティとイメージを関連付けようとする。 VERに適した最近の生成手法は、高い性能を示すが、高い計算コストを伴い、スケーラビリティと実用的な展開を制限している。本研究では、VERの対照的なパラダイムを再考し、オープンドメインVERの強力かつ効率的なベースラインを確立する、単純かつ効果的なフレームワークであるWikiCLIPを紹介する。 WikiCLIPは、知識に富むエンティティ表現として大きな言語モデル埋め込みを活用し、それらをパッチレベルでテキストセマンティクスと視覚的手がかりとを整合させるビジョンガイド型知識適応(VGKA)で強化する。さらにきめ細かい識別を促進するために、ハード負合成機構は訓練中に視覚的に似ているが意味的に異なる負を生成する。 OVENのような人気のあるオープンドメインのVERベンチマークの実験結果は、WikiCLIPが強いベースラインを大幅に上回っていることを示している。具体的には、WikiCLIPは、主要な生成モデルであるAutoVERと比較して、推論遅延を100倍近く削減しながら、難解なOVENセットに対して16%の改善を実現している。プロジェクトページはhttps://artanic30.github.io/project_pages/WikiCLIP/で公開されている。

論文の概要: WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

関連論文リスト