Related papers: Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

URL: http://arxiv.org/abs/2410.23676v1
Date: Thu, 31 Oct 2024 06:55:24 GMT
Title: Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
Authors: Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen,
Abstract summary: Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
Score: 56.55633052479446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain.

Related papers

Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z)
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation [25.283739839182147]
We show that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks.<n>We propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information into CoT data.<n>We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats.
arXiv Detail & Related papers (2025-07-03T17:59:29Z)
Knowledge Graphs for Enhancing Large Language Models in Entity Disambiguation [0.061446808540639365]
We use Knowledge Graphs to enhance Large Language Models (LLMs) for zero-shot Entity Disambiguation (ED)<n>We leverage the hierarchical representation of the entities' classes in a KG to prune the candidate space as well as the entities' descriptions to enrich the input prompt with additional factual knowledge.<n>Our evaluation on popular ED datasets shows that the proposed method outperforms non-enhanced and description-only enhanced LLMs, and has a higher degree of adaptability than task-specific models.
arXiv Detail & Related papers (2025-05-05T15:40:24Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA) Recent efforts primarily focus on scaling up training datasets through data collection and synthesis. We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z)
Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags [28.368960723666458]
Multimodal Large Language Models (MLLMs) struggle with critical problems when required to provide a precise and detailed response to a visual instruction. We show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data. We propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information.
arXiv Detail & Related papers (2024-06-16T08:20:12Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification. An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks. We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.