Grounded Situation Recognition with Transformers
- URL: http://arxiv.org/abs/2111.10135v1
- Date: Fri, 19 Nov 2021 10:10:03 GMT
- Title: Grounded Situation Recognition with Transformers
- Authors: Junhyeong Cho, Youngseok Yoon, Hyeonjun Lee, Suha Kwak
- Abstract summary: Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image.
Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture.
- Score: 11.202435939275675
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Grounded Situation Recognition (GSR) is the task that not only classifies a
salient action (verb), but also predicts entities (nouns) associated with
semantic roles and their locations in the given image. Inspired by the
remarkable success of Transformers in vision tasks, we propose a GSR model
based on a Transformer encoder-decoder architecture. The attention mechanism of
our model enables accurate verb classification by capturing high-level semantic
feature of an image effectively, and allows the model to flexibly deal with the
complicated and image-dependent relations between entities for improved noun
classification and localization. Our model is the first Transformer
architecture for GSR, and achieves the state of the art in every evaluation
metric on the SWiG benchmark. Our code is available at
https://github.com/jhcho99/gsrtr .
Related papers
- SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning [2.2184293265652895]
We present a transformer based network architecture for remote sensing image captioning (RSIC)<n>We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption.
arXiv Detail & Related papers (2025-07-17T07:11:01Z) - A Generative Approach for Wikipedia-Scale Visual Entity Recognition [56.55633052479446]
We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.
We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
arXiv Detail & Related papers (2024-03-04T13:47:30Z) - In-Domain GAN Inversion for Faithful Reconstruction and Editability [132.68255553099834]
We propose in-domain GAN inversion, which consists of a domain-guided domain-regularized and a encoder to regularize the inverted code in the native latent space of the pre-trained GAN model.
We make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property.
arXiv Detail & Related papers (2023-09-25T08:42:06Z) - Recursive Generalization Transformer for Image Super-Resolution [108.67898547357127]
We propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images.
We combine the RG-SA with local self-attention to enhance the exploitation of the global context.
Our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
arXiv Detail & Related papers (2023-03-11T10:44:44Z) - Iterative collaborative routing among equivariant capsules for
transformation-robust capsule networks [6.445605125467574]
We propose a capsule network model that is equivariant and compositionality-aware.
The awareness of compositionality comes from the use of our proposed novel, iterative, graph-based routing algorithm.
Experiments on transformed image classification on FashionMNIST, CIFAR-10, and CIFAR-100 show that our model that uses ICR outperforms convolutional and capsule baselines to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-10-20T08:47:18Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - ReSTR: Convolution-free Referring Image Segmentation Using Transformers [80.9672131755143]
We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR.
Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality.
Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
arXiv Detail & Related papers (2022-03-31T02:55:39Z) - Cross-view Geo-localization with Evolving Transformer [7.5800316275498645]
Cross-view geo-localization is challenging due to drastic appearance and geometry differences across views.
We devise a novel geo-localization Transformer (EgoTR) that utilizes the properties of self-attention in Transformer to model global dependencies.
Our EgoTR performs favorably against state-of-the-art methods on standard, fine-grained and cross-dataset cross-view geo-localization tasks.
arXiv Detail & Related papers (2021-07-02T05:33:14Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Co-Scale Conv-Attentional Image Transformers [22.834316796018705]
Co-scale conv-attentional image Transformers (CoaT) are a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms.
On ImageNet, relatively small CoaT models attain superior classification results compared with the similar-sized convolutional neural networks and image/vision Transformers.
arXiv Detail & Related papers (2021-04-13T17:58:29Z) - Toward a Controllable Disentanglement Network [22.968760397814993]
This paper addresses two crucial problems of learning disentangled image representations, namely controlling the degree of disentanglement during image editing, and balancing the disentanglement strength and the reconstruction quality.
By exploring the real-valued space of the soft target representation, we are able to synthesize novel images with the designated properties.
arXiv Detail & Related papers (2020-01-22T16:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.