Related papers: Visual Named Entity Linking: A New Dataset and A Baseline

Visual Named Entity Linking: A New Dataset and A Baseline

URL: http://arxiv.org/abs/2211.04872v1
Date: Wed, 9 Nov 2022 13:27:50 GMT
Title: Visual Named Entity Linking: A New Dataset and A Baseline
Authors: Wenxiang Sun, Yixing Fan, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng
Abstract summary: We consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image. We propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL) We present a high-quality human-annotated visual person linking dataset, named WIKIPerson.
Score: 61.38231023490981
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Entity Linking (VEL) is a task to link regions of images with their corresponding entities in Knowledge Bases (KBs), which is beneficial for many computer vision tasks such as image retrieval, image caption, and visual question answering. While existing tasks in VEL either rely on textual data to complement a multi-modal linking or only link objects with general entities, which fails to perform named entity linking on large amounts of image data. In this paper, we consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image. The task is to identify objects of interest (i.e., visual entity mentions) in images and link them to corresponding named entities in KBs. Since each entity often contains rich visual and textual information in KBs, we thus propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL). In addition, we present a high-quality human-annotated visual person linking dataset, named WIKIPerson. Based on WIKIPerson, we establish a series of baseline algorithms for the solution of each sub-task, and conduct experiments to verify the quality of proposed datasets and the effectiveness of baseline methods. We envision this work to be helpful for soliciting more works regarding VNEL in the future. The codes and datasets are publicly available at https://github.com/ict-bigdatalab/VNEL.

Related papers

Jodi: Unification of Visual Generation and Understanding via Joint Modeling [72.2478082170191]
We propose Jodi, a diffusion framework that unifies visual generation and understanding.<n>Jodi is built upon a linear diffusion transformer along with a role switch mechanism.<n>We present the Joint-1.6M dataset, which contains 200,000 high-quality images.
arXiv Detail & Related papers (2025-05-25T10:40:52Z)
Multimodal Reference Visual Grounding [24.047088603900644]
Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance. We introduce a new task named Multimodal Reference Visual Grounding (MRVG) We show that our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs.
arXiv Detail & Related papers (2025-04-02T00:19:05Z)
Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents [62.616106562146776]
We propose a textbfVisual-Centric textbfSelection approach via textbfAgents Collaboration (ViSA) Our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images.
arXiv Detail & Related papers (2025-02-27T09:37:30Z)
VP-MEL: Visual Prompts Guided Multimodal Entity Linking [16.463229055333407]
Multimodal entity linking (MEL) is a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB) Existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text. We propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information.
arXiv Detail & Related papers (2024-12-09T18:06:39Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking [16.728006492769666]
We propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance.
arXiv Detail & Related papers (2024-04-07T05:56:42Z)
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models [31.865208971014336]
We propose a task to verify how knowledge about entities acquired from natural language is retained in Vision & Language (V&L) models. This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption. We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks.
arXiv Detail & Related papers (2023-06-03T14:01:54Z)
Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z)
Multimodal Entity Linking for Tweets [6.439761523935613]
multimodal entity linking (MEL) is an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB) We propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB. Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts.
arXiv Detail & Related papers (2021-04-07T16:40:23Z)
Visual Pivoting for (Unsupervised) Entity Alignment [93.82387952905756]
This work studies the use of visual semantic representations to align entities in heterogeneous knowledge graphs (KGs) We show that the proposed new approach, EVA, creates a holistic entity representation that provides strong signals for cross-graph entity alignment.
arXiv Detail & Related papers (2020-09-28T20:09:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.