Related papers: Transforming Visual Scene Graphs to Image Captions

Transforming Visual Scene Graphs to Image Captions

URL: http://arxiv.org/abs/2305.02177v4
Date: Mon, 11 Dec 2023 09:05:00 GMT
Title: Transforming Visual Scene Graphs to Image Captions
Authors: Xu Yang, Jiawei Peng, Zihua Wang, Haiyang Xu, Qinghao Ye, Chenliang Li, Songfang Huang, Fei Huang, Zhangzikang Li and Yu Zhang
Abstract summary: We propose to transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. In TSG, each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words.
Score: 69.13204024990672
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://github.com/GaryJiajia/TSG.

Related papers

LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models [54.82915844507371]
Text-Attributed Graphs (TAGs) are ubiquitous in real-world scenarios. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures. We propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning.
arXiv Detail & Related papers (2025-03-05T09:45:22Z)
Learning Graph Quantized Tokenizers for Transformers [28.79505338383552]
Graph Transformers (GTs) have emerged as a leading model in deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks. We introduce GQT (textbfGraph textbfQuantized textbfTokenizer), which decouples tokenizer training from Transformer training by leveraging graph self-supervised learning. By combining the GQT with token modulation, a Transformer encoder achieves state-of-the-art performance on 16 out of 18 benchmarks, including large-scale homophilic and heterophilic datasets.
arXiv Detail & Related papers (2024-10-17T17:38:24Z)
A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior. Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks. GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z)
UniG-Encoder: A Universal Feature Encoder for Graph and Hypergraph Node Classification [6.977634174845066]
A universal feature encoder for both graph and hypergraph representation learning is designed, called UniG-Encoder. The architecture starts with a forward transformation of the topological relationships of connected nodes into edge or hyperedge features. The encoded node embeddings are then derived from the reversed transformation, described by the transpose of the projection matrix.
arXiv Detail & Related papers (2023-08-03T09:32:50Z)
Neural Machine Translation with Dynamic Graph Convolutional Decoder [32.462919670070654]
We propose an end-to-end translation architecture from the (graph & sequence) structural inputs to the (graph & sequence) outputs, where the target translation and its corresponding syntactic graph are jointly modeled and generated. We conduct extensive experiments on five widely acknowledged translation benchmarks, verifying our proposal achieves consistent improvements over baselines and other syntax-aware variants.
arXiv Detail & Related papers (2023-05-28T11:58:07Z)
Multi-View Graph Representation Learning Beyond Homophily [2.601278669926709]
Unsupervised graph representation learning(GRL) aims to distill diverse graph information into task-agnostic embeddings without label supervision. A novel framework, denoted as Multi-view Graph(MVGE) is proposed, and a set of key designs are identified.
arXiv Detail & Related papers (2023-04-15T08:35:49Z)
Training Free Graph Neural Networks for Graph Matching [103.45755859119035]
TFGM is a framework to boost the performance of Graph Neural Networks (GNNs) based graph matching without training. Applying TFGM on various GNNs shows promising improvements over baselines.
arXiv Detail & Related papers (2022-01-14T09:04:46Z)
MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs [55.66953093401889]
Masked graph autoencoder (MGAE) framework to perform effective learning on graph structure data. Taking insights from self-supervised learning, we randomly mask a large proportion of edges and try to reconstruct these missing edges during training.
arXiv Detail & Related papers (2022-01-07T16:48:07Z)
GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization [0.0]
We propose a novel method, GN-Transformer, to learn end-to-end on a fused sequence and graph modality. The proposed methods achieve state-of-the-art performance in two code summarization datasets and across three automatic code summarization metrics.
arXiv Detail & Related papers (2021-11-17T02:51:37Z)
Empirical Analysis of Image Caption Generation using Deep Learning [0.0]
We have implemented and experimented with various flavors of multi-modal image captioning networks. The goal is to analyze the performance of each approach using various evaluation metrics.
arXiv Detail & Related papers (2021-05-14T05:38:13Z)
Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision. We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities. 90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.