Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation
- URL: http://arxiv.org/abs/2306.08487v2
- Date: Wed, 21 Jun 2023 01:42:17 GMT
- Title: Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation
- Authors: Likang Wu, Zhi Li, Hongke Zhao, Zhefeng Wang, Qi Liu, Baoxing Huai,
Nicholas Jing Yuan, Enhong Chen
- Abstract summary: We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
- Score: 68.13453771001522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen
objects, is a promising learning paradigm to understand new real-world
knowledge for machines continuously. Recently, the Knowledge Graph (KG) has
been proven as an effective scheme for handling the zero-shot task with
large-scale and non-attribute data. Prior studies always embed relationships of
seen and unseen objects into visual information from existing knowledge graphs
to promote the cognitive ability of the unseen data. Actually, real-world
knowledge is naturally formed by multimodal facts. Compared with ordinary
structural knowledge from a graph perspective, multimodal KG can provide
cognitive systems with fine-grained knowledge. For example, the text
description and visual content can depict more critical details of a fact than
only depending on knowledge triplets. Unfortunately, this multimodal
fine-grained knowledge is largely unexploited due to the bottleneck of feature
alignment between different modalities. To that end, we propose a multimodal
intensive ZSL framework that matches regions of images with corresponding
semantic embeddings via a designed dense attention module and self-calibration
loss. It makes the semantic transfer process of our ZSL framework learns more
differentiated knowledge between entities. Our model also gets rid of the
performance limitation of only using rough global features. We conduct
extensive experiments and evaluate our model on large-scale real-world data.
The experimental results clearly demonstrate the effectiveness of the proposed
model in standard zero-shot classification tasks.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations.
Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder
with Semantic Concepts [0.9054540533394924]
Recent techniques try to learn a cross-modal mapping between the semantic space and the image space.
We propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space.
Our results show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
arXiv Detail & Related papers (2021-06-26T20:08:37Z) - Towards a Universal Continuous Knowledge Base [49.95342223987143]
We propose a method for building a continuous knowledge base that can store knowledge imported from multiple neural networks.
Experiments on text classification show promising results.
We import the knowledge from multiple models to the knowledge base, from which the fused knowledge is exported back to a single model.
arXiv Detail & Related papers (2020-12-25T12:27:44Z) - All About Knowledge Graphs for Actions [82.39684757372075]
We propose a better understanding of knowledge graphs (KGs) that can be utilized for zero-shot and few-shot action recognition.
We study three different construction mechanisms for KGs: action embeddings, action-object embeddings, visual embeddings.
We present extensive analysis of the impact of different KGs on different experimental setups.
arXiv Detail & Related papers (2020-08-28T01:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.