Say As You Wish: Fine-grained Control of Image Caption Generation with
Abstract Scene Graphs
- URL: http://arxiv.org/abs/2003.00387v1
- Date: Sun, 1 Mar 2020 03:34:07 GMT
- Title: Say As You Wish: Fine-grained Control of Image Caption Generation with
Abstract Scene Graphs
- Authors: Shizhe Chen, Qin Jin, Peng Wang, Qi Wu
- Abstract summary: We propose the Abstract Scene Graph structure to represent user intention in fine-grained level.
From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph.
Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets.
- Score: 74.88118535585903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are able to describe image contents with coarse to fine details as
they wish. However, most image captioning models are intention-agnostic which
can not generate diverse descriptions according to different user intentions
initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure
to represent user intention in fine-grained level and control what and how
detailed the generated description should be. The ASG is a directed graph
consisting of three types of \textbf{abstract nodes} (object, attribute,
relationship) grounded in the image without any concrete semantic labels. Thus
it is easy to obtain either manually or automatically. From the ASG, we propose
a novel ASG2Caption model, which is able to recognise user intentions and
semantics in the graph, and therefore generate desired captions according to
the graph structure. Our model achieves better controllability conditioning on
ASGs than carefully designed baselines on both VisualGenome and MSCOCO
datasets. It also significantly improves the caption diversity via
automatically sampling diverse ASGs as control signals.
Related papers
- Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions [53.069446715005924]
Graph-based captioning (GBC) describes an image using a labelled graph structure.
nodes in GBC are created using, in a first stage, object detection and dense captioning tools.
We show that using GBC nodes' annotations results in significant performance boost on downstream models.
arXiv Detail & Related papers (2024-07-09T09:55:04Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Towards Few-shot Entity Recognition in Document Images: A Graph Neural
Network Approach Robust to Image Manipulation [38.09501948846373]
We introduce the topological adjacency relationship among the tokens, emphasizing their relative position information.
We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings.
Experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings.
arXiv Detail & Related papers (2023-05-24T07:34:33Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Image Semantic Relation Generation [0.76146285961466]
Scene graphs can distil complex image information and correct the bias of visual models using semantic-level relations.
In this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model.
arXiv Detail & Related papers (2022-10-19T16:15:19Z) - Target-oriented Sentiment Classification with Sequential Cross-modal
Semantic Graph [27.77392307623526]
Multi-modal aspect-based sentiment classification (MABSC) is task of classifying the sentiment of a target entity mentioned in a sentence and an image.
Previous methods failed to account for the fine-grained semantic association between the image and the text.
We propose a new approach called SeqCSG, which enhances the encoder-decoder sentiment classification framework using sequential cross-modal semantic graphs.
arXiv Detail & Related papers (2022-08-19T16:04:29Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.