Co-Occurrence Matters: Learning Action Relation for Temporal Action
Localization
- URL: http://arxiv.org/abs/2303.08463v1
- Date: Wed, 15 Mar 2023 09:07:04 GMT
- Title: Co-Occurrence Matters: Learning Action Relation for Temporal Action
Localization
- Authors: Congqi Cao, Yizhe Wang, Yue Lu, Xin Zhang and Yanning Zhang
- Abstract summary: We propose a novel Co-Occurrence Relation Module (CORM) that explicitly models the co-occurrence relationship between actions.
Besides the visual information, it further utilizes the semantic embeddings of class labels to model the co-occurrence relationship.
Our method achieves high multi-label relationship modeling capacity.
- Score: 41.44022912961265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization (TAL) is a prevailing task due to its great
application potential. Existing works in this field mainly suffer from two
weaknesses: (1) They often neglect the multi-label case and only focus on
temporal modeling. (2) They ignore the semantic information in class labels and
only use the visual information. To solve these problems, we propose a novel
Co-Occurrence Relation Module (CORM) that explicitly models the co-occurrence
relationship between actions. Besides the visual information, it further
utilizes the semantic embeddings of class labels to model the co-occurrence
relationship. The CORM works in a plug-and-play manner and can be easily
incorporated with the existing sequence models. By considering both visual and
semantic co-occurrence, our method achieves high multi-label relationship
modeling capacity. Meanwhile, existing datasets in TAL always focus on
low-semantic atomic actions. Thus we construct a challenging multi-label
dataset UCF-Crime-TAL that focuses on high-semantic actions by annotating the
UCF-Crime dataset at frame level and considering the semantic overlap of
different events. Extensive experiments on two commonly used TAL datasets,
\textit{i.e.}, MultiTHUMOS and TSU, and our newly proposed UCF-Crime-TAL
demenstrate the effectiveness of the proposed CORM, which achieves
state-of-the-art performance on these datasets.
Related papers
- Concrete Subspace Learning based Interference Elimination for Multi-task
Model Fusion [86.6191592951269]
Merging models fine-tuned from common extensively pretrained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multitask model that performs well across diverse tasks.
We propose the CONtinuous relaxation dis (Concrete) subspace learning method to identify a common lowdimensional subspace and utilize its shared information track interference problem without sacrificing performance.
arXiv Detail & Related papers (2023-12-11T07:24:54Z) - RelVAE: Generative Pretraining for few-shot Visual Relationship
Detection [2.2230760534775915]
We present the first pretraining method for few-shot predicate classification that does not require any annotated relations.
We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets.
arXiv Detail & Related papers (2023-11-27T19:08:08Z) - Dynamically Updating Event Representations for Temporal Relation
Classification with Multi-category Learning [35.27714529976667]
Temporal relation classification is a pair-wise task for identifying the relation of a temporal link (Tlink) between two mentions.
This paper presents an event centric model that allows to manage dynamic event representations across multiple Tlink categories.
Our proposal outperforms state-of-the-art models and two transfer learning baselines on both the English and Japanese data.
arXiv Detail & Related papers (2023-10-31T07:41:24Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference,
Temporal, Causal, and Subevent Relation Extraction [78.61546292830081]
We construct a large-scale human-annotated ERE dataset MAVEN-ERE with improved annotation schemes.
It contains 103,193 event coreference chains, 1,216,217 temporal relations, 57,992 causal relations, and 15,841 subevent relations.
Experiments show that ERE on MAVEN-ERE is quite challenging, and considering relation interactions with joint learning can improve performances.
arXiv Detail & Related papers (2022-11-14T13:34:49Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.