Disentangled Motif-aware Graph Learning for Phrase Grounding
- URL: http://arxiv.org/abs/2104.06008v1
- Date: Tue, 13 Apr 2021 08:20:07 GMT
- Title: Disentangled Motif-aware Graph Learning for Phrase Grounding
- Authors: Zongshen Mu, Siliang Tang, Jie Tan, Qiang Yu, Yueting Zhuang
- Abstract summary: We propose a novel graph learning framework for phrase grounding in the image.
We devise the disentangled graph network to integrate the motif-aware contextual information into representations.
Our model achieves state-of-the-art performance on Flickr30K Entities and ReferIt Game benchmarks.
- Score: 48.64279161780489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel graph learning framework for phrase
grounding in the image. Developing from the sequential to the dense graph
model, existing works capture coarse-grained context but fail to distinguish
the diversity of context among phrases and image regions. In contrast, we pay
special attention to different motifs implied in the context of the scene graph
and devise the disentangled graph network to integrate the motif-aware
contextual information into representations. Besides, we adopt interventional
strategies at the feature and the structure levels to consolidate and
generalize representations. Finally, the cross-modal attention network is
utilized to fuse intra-modal features, where each phrase can be computed
similarity with regions to select the best-grounded one. We validate the
efficiency of disentangled and interventional graph network (DIGN) through a
series of ablation studies, and our model achieves state-of-the-art performance
on Flickr30K Entities and ReferIt Game benchmarks.
Related papers
- Two Stream Scene Understanding on Graph Embedding [4.78180589767256]
The paper presents a novel two-stream network architecture for enhancing scene understanding in computer vision.
The graph feature stream network comprises a segmentation structure, scene graph generation, and a graph representation module.
Experiments conducted on the ADE20K dataset demonstrate the effectiveness of the proposed two-stream network in improving image classification accuracy.
arXiv Detail & Related papers (2023-11-12T05:57:56Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Group Contrastive Self-Supervised Learning on Graphs [101.45974132613293]
We study self-supervised learning on graphs using contrastive methods.
We argue that contrasting graphs in multiple subspaces enables graph encoders to capture more abundant characteristics.
arXiv Detail & Related papers (2021-07-20T22:09:21Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.