Open-domain Visual Entity Recognition: Towards Recognizing Millions of
Wikipedia Entities
- URL: http://arxiv.org/abs/2302.11154v2
- Date: Fri, 24 Feb 2023 00:50:25 GMT
- Title: Open-domain Visual Entity Recognition: Towards Recognizing Millions of
Wikipedia Entities
- Authors: Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi,
Kenton Lee, Kristina Toutanova, Ming-Wei Chang
- Abstract summary: We present OVEN-Wiki, where a model need to link an image onto a Wikipedia entity with respect to a text query.
We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning.
While PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.
- Score: 54.26896306906937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit
strong generalization on various visual domains and tasks. However, existing
image classification benchmarks often evaluate recognition on a specific domain
(e.g., outdoor images) or a specific task (e.g., classifying plant species),
which falls short of evaluating whether pre-trained foundational models are
universal visual recognizers. To address this, we formally present the task of
Open-domain Visual Entity recognitioN (OVEN), where a model need to link an
image onto a Wikipedia entity with respect to a text query. We construct
OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto
one single label space: Wikipedia entities. OVEN challenges models to select
among six million possible Wikipedia entities, making it a general visual
recognition benchmark with the largest number of labels. Our study on
state-of-the-art pre-trained models reveals large headroom in generalizing to
the massive-scale label space. We show that a PaLI-based auto-regressive visual
recognition model performs surprisingly well, even on Wikipedia entities that
have never been seen during fine-tuning. We also find existing pretrained
models yield different strengths: while PaLI-based models obtain higher overall
performance, CLIP-based models are better at recognizing tail entities.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models [13.972809192907931]
Foundation models (FMs) are large neural networks trained on broad datasets.
Human activity recognition in video has advanced with FMs, driven by competition among different architectures.
This paper empirically evaluates how perspective changes affect different FMs in fine-grained human activity recognition.
arXiv Detail & Related papers (2024-07-22T12:59:57Z) - A Generative Approach for Wikipedia-Scale Visual Entity Recognition [56.55633052479446]
We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.
We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
arXiv Detail & Related papers (2024-03-04T13:47:30Z) - Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction [17.989559761931435]
We propose a novel "Fine-grained Visual-Semantic Interaction" framework for WSI classification.
It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics.
Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset.
arXiv Detail & Related papers (2024-02-29T16:29:53Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - Towards Open-Ended Visual Recognition with Large Language Model [27.56182473356992]
We introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier.
OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing.
It also enables cross-dataset training without any human interference.
arXiv Detail & Related papers (2023-11-14T18:59:01Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Geometric Perception based Efficient Text Recognition [0.0]
In real-world applications with fixed camera positions, the underlying data tends to be regular scene text.
This paper introduces the underlying concepts, theory, implementation, and experiment results to develop specialized models.
We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present.
arXiv Detail & Related papers (2023-02-08T04:19:24Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.