Progressive Tree-Structured Prototype Network for End-to-End Image
Captioning
- URL: http://arxiv.org/abs/2211.09460v1
- Date: Thu, 17 Nov 2022 11:04:00 GMT
- Title: Progressive Tree-Structured Prototype Network for End-to-End Image
Captioning
- Authors: Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao
- Abstract summary: We propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN)
PTSN is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics.
Our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server.
- Score: 74.8547752611337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Studies of image captioning are shifting towards a trend of a fully
end-to-end paradigm by leveraging powerful visual pre-trained models and
transformer-based generation architecture for more flexible model training and
faster inference speed. State-of-the-art approaches simply extract isolated
concepts or attributes to assist description generation. However, such
approaches do not consider the hierarchical semantic structure in the textual
domain, which leads to an unpredictable mapping between visual representations
and concept words. To this end, we propose a novel Progressive Tree-Structured
prototype Network (dubbed PTSN), which is the first attempt to narrow down the
scope of prediction words with appropriate semantics by modeling the
hierarchical textual semantics. Specifically, we design a novel embedding
method called tree-structured prototype, producing a set of hierarchical
representative embeddings which capture the hierarchical semantic structure in
textual space. To utilize such tree-structured prototypes into visual
cognition, we also propose a progressive aggregation module to exploit semantic
relationships within the image and prototypes. By applying our PTSN to the
end-to-end captioning framework, extensive experiments conducted on MSCOCO
dataset show that our method achieves a new state-of-the-art performance with
144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on
`Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official
online test server. Trained models and source code have been released at:
https://github.com/NovaMind-Z/PTSN.
Related papers
- Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - Emergent Visual-Semantic Hierarchies in Image-Text Representations [13.300199242824934]
We study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies.
We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding.
arXiv Detail & Related papers (2024-07-11T14:09:42Z) - PRIOR: Prototype Representation Joint Learning from Medical Images and
Reports [19.336988866061294]
We present a prototype representation learning framework incorporating both global and local alignment between medical images and reports.
In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation.
A sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features.
arXiv Detail & Related papers (2023-07-24T07:49:01Z) - Prototype-based Embedding Network for Scene Graph Generation [105.97836135784794]
Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs.
Due to the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category.
Prototype-based Embedding Network (PE-Net) models entities/predicates with prototype-aligned compact and distinctive representations.
PL is introduced to help PE-Net efficiently learn such entitypredicate matching, and Prototype Regularization (PR) is devised to relieve the ambiguous entity-predicate matching.
arXiv Detail & Related papers (2023-03-13T13:30:59Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - HCSC: Hierarchical Contrastive Selective Coding [44.655310210531226]
Hierarchical Contrastive Selective Coding (HCSC) is a novel contrastive learning framework.
We introduce an elaborate pair selection scheme to make image representations better fit semantic structures.
We verify the superior performance of HCSC over state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-02-01T15:04:40Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive
Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples.
Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype.
This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.