Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer
- URL: http://arxiv.org/abs/2404.15785v1
- Date: Wed, 24 Apr 2024 10:17:13 GMT
- Title: Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer
- Authors: Jiaming Lei, Lin Li, Chunping Wang, Jun Xiao, Long Chen,
- Abstract summary: grounded situation recognition (GSR) requires the model to detect all semantic roles that participate in the action.
This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition.
We introduce a new approach for zero-shot GSR via Language EXplainer (LEX)
- Score: 15.21084337999065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
Related papers
- Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition [3.8623569699070357]
This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs.
Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance.
arXiv Detail & Related papers (2024-06-21T06:30:16Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - GRILL: Grounded Vision-language Pre-training via Aligning Text and Image
Regions [92.96783800362886]
Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.
We introduce GRILL, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances.
arXiv Detail & Related papers (2023-05-24T03:33:21Z) - Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z) - GSRFormer: Grounded Situation Recognition Transformer with Alternate
Semantic Attention Refinement [73.73599110214828]
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for human-like'' event understanding.
Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework.
We propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles.
arXiv Detail & Related papers (2022-08-18T17:13:59Z) - Rethinking the Role of Demonstrations: What Makes In-Context Learning
Work? [112.72413411257662]
Large language models (LMs) are able to in-context learn by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs.
We show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance.
We find that other aspects of the demonstrations are the key drivers of end task performance.
arXiv Detail & Related papers (2022-02-25T17:25:19Z) - Rethinking the Two-Stage Framework for Grounded Situation Recognition [61.93345308377144]
Grounded Situation Recognition is an essential step towards "human-like" event understanding.
Existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.
We propose a novel SituFormer for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM)
arXiv Detail & Related papers (2021-12-10T08:10:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.