Rethinking the Two-Stage Framework for Grounded Situation Recognition
- URL: http://arxiv.org/abs/2112.05375v1
- Date: Fri, 10 Dec 2021 08:10:56 GMT
- Title: Rethinking the Two-Stage Framework for Grounded Situation Recognition
- Authors: Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua
- Abstract summary: Grounded Situation Recognition is an essential step towards "human-like" event understanding.
Existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.
We propose a novel SituFormer for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM)
- Score: 61.93345308377144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounded Situation Recognition (GSR), i.e., recognizing the salient activity
(or verb) category in an image (e.g., buying) and detecting all corresponding
semantic roles (e.g., agent and goods), is an essential step towards
"human-like" event understanding. Since each verb is associated with a specific
set of semantic roles, all existing GSR methods resort to a two-stage
framework: predicting the verb in the first stage and detecting the semantic
roles in the second stage. However, there are obvious drawbacks in both stages:
1) The widely-used cross-entropy (XE) loss for object recognition is
insufficient in verb classification due to the large intra-class variation and
high inter-class similarity among daily activities. 2) All semantic roles are
detected in an autoregressive manner, which fails to model the complex semantic
relations between different roles. To this end, we propose a novel SituFormer
for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a
Transformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: a
coarse-grained model trained with XE loss first proposes a set of verb
candidates, and then a fine-grained model trained with triplet loss re-ranks
these candidates with enhanced verb features (not only separable but also
discriminative). TNM is a transformer-based semantic role detection model,
which detects all roles parallelly. Owing to the global relation modeling
ability and flexibility of the transformer decoder, TNM can fully explore the
statistical dependency of the roles. Extensive validations on the challenging
SWiG benchmark show that SituFormer achieves a new state-of-the-art performance
with significant gains under various metrics. Code is available at
https://github.com/kellyiss/SituFormer.
Related papers
- Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - RAGFormer: Learning Semantic Attributes and Topological Structure for Fraud Detection [8.050935113945428]
We present a novel framework called Relation-Aware GNN with transFormer(RAGFormer)
RAGFormer embeds both semantic and topological features into a target node.
The simple yet effective network consists of a semantic encoder, a topology encoder, and an attention fusion module.
arXiv Detail & Related papers (2024-02-27T12:53:15Z) - GSRFormer: Grounded Situation Recognition Transformer with Alternate
Semantic Attention Refinement [73.73599110214828]
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for human-like'' event understanding.
Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework.
We propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles.
arXiv Detail & Related papers (2022-08-18T17:13:59Z) - ReSTR: Convolution-free Referring Image Segmentation Using Transformers [80.9672131755143]
We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR.
Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality.
Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
arXiv Detail & Related papers (2022-03-31T02:55:39Z) - Decoupled Multi-task Learning with Cyclical Self-Regulation for Face
Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing.
Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection.
Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z) - Few Shot Activity Recognition Using Variational Inference [9.371378627575883]
We propose a novel variational inference based architectural framework (HF-AR) for few shot activity recognition.
Our framework leverages volume-preserving Householder Flow to learn a flexible posterior distribution of the novel classes.
This results in better performance as compared to state-of-the-art few shot approaches for human activity recognition.
arXiv Detail & Related papers (2021-08-20T03:57:58Z) - Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings [67.11712279612583]
Cycle-consistent training is widely used for learning a forward and inverse mapping between two domains of interest.
We develop a conditional variational autoencoder (CVAE) approach that can be viewed as converting surjective mappings to implicit bijections.
Our pipeline can capture such many-to-one mappings during cycle training while promoting graph-to-text diversity.
arXiv Detail & Related papers (2020-12-14T10:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.