A Generative Approach for Wikipedia-Scale Visual Entity Recognition
- URL: http://arxiv.org/abs/2403.02041v2
- Date: Thu, 21 Mar 2024 14:59:13 GMT
- Title: A Generative Approach for Wikipedia-Scale Visual Entity Recognition
- Authors: Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid,
- Abstract summary: We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.
We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
- Score: 56.55633052479446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Revisit Anything: Visual Place Recognition via Image Segment Retrieval [8.544326445217369]
Existing visual place recognition pipelines encode the "whole" image and search for matches.
We address this by encoding and searching for "image segments" instead of the whole images.
We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval.
arXiv Detail & Related papers (2024-09-26T16:49:58Z) - ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling [32.55352435358949]
We propose a sentence generation-based retrieval formulation for attribute recognition.
For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence.
We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
arXiv Detail & Related papers (2024-08-07T21:44:29Z) - Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Open-domain Visual Entity Recognition: Towards Recognizing Millions of
Wikipedia Entities [54.26896306906937]
We present OVEN-Wiki, where a model need to link an image onto a Wikipedia entity with respect to a text query.
We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning.
While PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.
arXiv Detail & Related papers (2023-02-22T05:31:26Z) - An Effective Automatic Image Annotation Model Via Attention Model and
Data Equilibrium [0.0]
The proposed model has three phases, including a feature extractor, a tag generator, and an image annotator.
The experiments conducted on two benchmark datasets confirm that the superiority of the proposed model compared to the previous models.
arXiv Detail & Related papers (2020-01-26T05:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.