Topic Scene Graph Generation by Attention Distillation from Caption
- URL: http://arxiv.org/abs/2110.05731v1
- Date: Tue, 12 Oct 2021 04:26:12 GMT
- Title: Topic Scene Graph Generation by Attention Distillation from Caption
- Authors: W. Wang, R. Wang, X. Chen
- Abstract summary: A scene graph is not as practical as expected unless it can reduce the trivial contents and noises.
We let the scene graph borrow the ability from the image caption so that it can be a specialist on the basis of remaining all-around.
Experiments show that attention distillation brings significant improvements in mining important relationships without strong supervision.
- Score: 1.181694273002388
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: If an image tells a story, the image caption is the briefest narrator.
Generally, a scene graph prefers to be an omniscient generalist, while the
image caption is more willing to be a specialist, which outlines the gist. Lots
of previous studies have found that a scene graph is not as practical as
expected unless it can reduce the trivial contents and noises. In this respect,
the image caption is a good tutor. To this end, we let the scene graph borrow
the ability from the image caption so that it can be a specialist on the basis
of remaining all-around, resulting in the so-called Topic Scene Graph. What an
image caption pays attention to is distilled and passed to the scene graph for
estimating the importance of partial objects, relationships, and events.
Specifically, during the caption generation, the attention about individual
objects in each time step is collected, pooled, and assembled to obtain the
attention about relationships, which serves as weak supervision for
regularizing the estimated importance scores of relationships. In addition, as
this attention distillation process provides an opportunity for combining the
generation of image caption and scene graph together, we further transform the
scene graph into linguistic form with rich and free-form expressions by sharing
a single generation model with image caption. Experiments show that attention
distillation brings significant improvements in mining important relationships
without strong supervision, and the topic scene graph shows great potential in
subsequent applications.
Related papers
- GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives [69.36723767339001]
We propose a novel framework named textitGPT4SGG to obtain more accurate and comprehensive scene graph signals.
We show textitGPT4SGG significantly improves the performance of SGG models trained on image-caption data.
arXiv Detail & Related papers (2023-12-07T14:11:00Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z) - Linguistic Structures as Weak Supervision for Visual Scene Graph
Generation [39.918783911894245]
We show how linguistic structures in captions can benefit scene graph generation.
Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects.
Given the large and diverse sources of multimodal data on the web, linguistic supervision is more scalable than crowdsourced triplets.
arXiv Detail & Related papers (2021-05-28T17:20:27Z) - A Comprehensive Survey of Scene Graphs: Generation and Application [42.07469181785126]
Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene.
No relatively systematic survey of scene graphs exists at present.
arXiv Detail & Related papers (2021-03-17T04:24:20Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z) - Are scene graphs good enough to improve Image Captioning? [19.36188161855731]
We investigate the use of scene graphs in image captioning.
We find no significant difference between models that use scene graph features and models that only use object detection features.
Although the quality of predicted scene graphs is very low in general, when using high quality scene graphs we obtain gains of up to 3.3 CIDEr.
arXiv Detail & Related papers (2020-09-25T16:09:08Z) - Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation [98.34909905511061]
We argue that a desirable scene graph should be hierarchically constructed, and introduce a new scheme for modeling scene graph.
To generate a scene graph based on HET, we parse HET with a Hybrid Long Short-Term Memory (Hybrid-LSTM) which specifically encodes hierarchy and siblings context.
To further prioritize key relations in the scene graph, we devise a Relation Ranking Module (RRM) to dynamically adjust their rankings.
arXiv Detail & Related papers (2020-07-17T05:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.