CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation
- URL: http://arxiv.org/abs/2407.11393v2
- Date: Wed, 17 Jul 2024 16:40:05 GMT
- Title: CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation
- Authors: Kalliopi Basioti, Mohamed A. Abdelsalam, Federico Fancellu, Vladimir Pavlovic, Afsaneh Fazly,
- Abstract summary: We propose a novel, fully automatic method to sample additional focused and visually grounded captions.
We leverage Abstract Meaning Representation (AMR) to encode all possible semantic-semantic relations between entities.
We then develop a new model, CIC-BART-SSA, that sources its control signals from SSA-diversified datasets.
- Score: 9.493755431645313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image-language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image-caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios. Code is available at https://github.com/SamsungLabs/CIC-BART-SSA.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg.
AdaB Decoder is designed to generate different image embeddings for both training and new classes.
SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World
Understanding [93.45067274442881]
This paper extends Contrastive language-image pre-training (CLIP) with multi-granularity alignment.
We develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
Multimodal Alignment [11.556516260190737]
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research.
This paper proposes Contrastive Captioners (CoCa) to integrate Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework.
arXiv Detail & Related papers (2024-01-04T08:42:36Z) - Stacked Cross-modal Feature Consolidation Attention Networks for Image
Captioning [1.4337588659482516]
This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information.
We propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features.
Our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
arXiv Detail & Related papers (2023-02-08T09:15:09Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - CAGAN: Text-To-Image Generation with Combined Attention GANs [70.3497683558609]
We propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions.
The proposed CAGAN uses two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels.
With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset.
arXiv Detail & Related papers (2021-04-26T15:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.