Related papers: MOC-GAN: Mixing Objects and Captions to Generate Realistic Images

MOC-GAN: Mixing Objects and Captions to Generate Realistic Images

URL: http://arxiv.org/abs/2106.03128v1
Date: Sun, 6 Jun 2021 14:04:07 GMT
Title: MOC-GAN: Mixing Objects and Captions to Generate Realistic Images
Authors: Tao Ma, Yikang Li
Abstract summary: We introduce a more rational setting, generating a realistic image from the objects and captions. Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections. A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
Score: 21.240099965546637
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating images with conditional descriptions gains increasing interests in recent years. However, existing conditional inputs are suffering from either unstructured forms (captions) or limited information and expensive labeling (scene graphs). For a targeted scene, the core items, objects, are usually definite while their interactions are flexible and hard to clearly define. Thus, we introduce a more rational setting, generating a realistic image from the objects and captions. Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections. Correspondingly, a MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images. It firstly infers the implicit relations between object pairs from the captions to build a hidden-state scene graph. So a multi-layer representation containing objects, relations and captions is constructed, where the scene graph provides the structures of the scene and the caption provides the image-level guidance. Then a cascaded attentive generative network is designed to coarse-to-fine generate phrase patch by paying attention to the most relevant words in the caption. In addition, a phrase-wise DAMSM is proposed to better supervise the fine-grained phrase-patch consistency. On COCO dataset, our method outperforms the state-of-the-art methods on both Inception Score and FID while maintaining high visual quality. Extensive experiments demonstrate the unique features of our proposed method.

Related papers

SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z)
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z)
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z)
Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z)
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Consensus Graph Representation Learning for Better Grounded Image Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z)
DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level. Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z)
Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning. To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions. We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z)
Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information. We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image. The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z)
Exploring Explicit and Implicit Visual Relationships for Image Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.