Comprehensive Image Captioning via Scene Graph Decomposition
- URL: http://arxiv.org/abs/2007.11731v1
- Date: Thu, 23 Jul 2020 00:59:21 GMT
- Title: Comprehensive Image Captioning via Scene Graph Decomposition
- Authors: Yiwu Zhong, Liwei Wang, Jianshu Chen, Dong Yu, Yin Li
- Abstract summary: We address the challenging problem of image captioning by revisiting the representation of image scene graph.
At the core of our method lies the decomposition of a scene graph into a set of sub-graphs.
We design a deep model to select important sub-graphs, and to decode each selected sub-graph into a single target sentence.
- Score: 51.660090468384375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the challenging problem of image captioning by revisiting the
representation of image scene graph. At the core of our method lies the
decomposition of a scene graph into a set of sub-graphs, with each sub-graph
capturing a semantic component of the input image. We design a deep model to
select important sub-graphs, and to decode each selected sub-graph into a
single target sentence. By using sub-graphs, our model is able to attend to
different components of the image. Our method thus accounts for accurate,
diverse, grounded and controllable captioning at the same time. We present
extensive experiments to demonstrate the benefits of our comprehensive
captioning model. Our method establishes new state-of-the-art results in
caption diversity, grounding, and controllability, and compares favourably to
latest methods in caption quality. Our project website can be found at
http://pages.cs.wisc.edu/~yiwuzhong/Sub-GC.html.
Related papers
- What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Noise-aware Learning from Web-crawled Image-Text Data for Image
Captioning [6.101765622702223]
Noise-aware Captioning (NoC) framework learns rich knowledge from the whole web-crawled data while being less affected by the noises.
This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal.
An in-depth analysis shows the effectiveness of our framework in handling noise.
arXiv Detail & Related papers (2022-12-27T17:33:40Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - Robust Image Captioning [3.20603058999901]
In this study, we leverage the Object Relation using adversarial robust cut algorithm.
Our experimental study represent the promising performance of our proposed method for image captioning.
arXiv Detail & Related papers (2020-12-06T00:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.