GSRFormer: Grounded Situation Recognition Transformer with Alternate
Semantic Attention Refinement
- URL: http://arxiv.org/abs/2208.08965v1
- Date: Thu, 18 Aug 2022 17:13:59 GMT
- Title: GSRFormer: Grounded Situation Recognition Transformer with Alternate
Semantic Attention Refinement
- Authors: Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, Alexander Hauptmann
- Abstract summary: Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for human-like'' event understanding.
Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework.
We propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles.
- Score: 73.73599110214828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounded Situation Recognition (GSR) aims to generate structured semantic
summaries of images for ``human-like'' event understanding. Specifically, GSR
task not only detects the salient activity verb (e.g. buying), but also
predicts all corresponding semantic roles (e.g. agent and goods). Inspired by
object detection and image captioning tasks, existing methods typically employ
a two-stage framework: 1) detect the activity verb, and then 2) predict
semantic roles based on the detected verb. Obviously, this illogical framework
constitutes a huge obstacle to semantic understanding. First, pre-detecting
verbs solely without semantic roles inevitably fails to distinguish many
similar daily activities (e.g., offering and giving, buying and selling).
Second, predicting semantic roles in a closed auto-regressive manner can hardly
exploit the semantic relations among the verb and roles. To this end, in this
paper we propose a novel two-stage framework that focuses on utilizing such
bidirectional relations within verbs and roles. In the first stage, instead of
pre-detecting the verb, we postpone the detection step and assume a pseudo
label, where an intermediate representation for each corresponding semantic
role is learned from images. In the second stage, we exploit transformer layers
to unearth the potential semantic relations within both verbs and semantic
roles. With the help of a set of support images, an alternate learning scheme
is designed to simultaneously optimize the results: update the verb using nouns
corresponding to the image, and update nouns using verbs from support images.
Extensive experimental results on challenging SWiG benchmarks show that our
renovated framework outperforms other state-of-the-art methods under various
metrics.
Related papers
- Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer [15.21084337999065]
grounded situation recognition (GSR) requires the model to detect all semantic roles that participate in the action.
This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition.
We introduce a new approach for zero-shot GSR via Language EXplainer (LEX)
arXiv Detail & Related papers (2024-04-24T10:17:13Z) - Towards Image Semantics and Syntax Sequence Learning [8.033697392628424]
We introduce the concept of "image grammar", consisting of "image semantics" and "image syntax"
We propose a weakly supervised two-stage approach to learn the image grammar relative to a class of visual objects/scenes.
Our framework is trained to reason over patch semantics and detect faulty syntax.
arXiv Detail & Related papers (2024-01-31T00:16:02Z) - Do Trajectories Encode Verb Meaning? [22.409307683247967]
Grounded language models learn to connect concrete categories like nouns and adjectives to the world via images and videos.
In this paper, we investigate the extent to which trajectories (i.e. the position and rotation of objects over time) naturally encode verb semantics.
We find that trajectories correlate as-is with some verbs (e.g., fall), and that additional abstraction via self-supervised pretraining can further capture nuanced differences in verb meaning.
arXiv Detail & Related papers (2022-06-23T19:57:16Z) - Comprehending and Ordering Semantics for Image Captioning [124.48670699658649]
We propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net)
COS-Net unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture.
arXiv Detail & Related papers (2022-06-14T15:51:14Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - Rethinking the Two-Stage Framework for Grounded Situation Recognition [61.93345308377144]
Grounded Situation Recognition is an essential step towards "human-like" event understanding.
Existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.
We propose a novel SituFormer for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM)
arXiv Detail & Related papers (2021-12-10T08:10:56Z) - Constructing Phrase-level Semantic Labels to Form Multi-Grained
Supervision for Image-Text Retrieval [48.20798265640068]
We introduce additional phrase-level supervision for the better identification of mismatched units in the text.
We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels.
For the training, we propose multi-scale matching losses from both global and local perspectives.
arXiv Detail & Related papers (2021-09-12T14:21:15Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.