Inferring Temporal Compositions of Actions Using Probabilistic Automata
- URL: http://arxiv.org/abs/2004.13217v1
- Date: Tue, 28 Apr 2020 00:15:26 GMT
- Title: Inferring Temporal Compositions of Actions Using Probabilistic Automata
- Authors: Rodrigo Santa Cruz, Anoop Cherian, Basura Fernando, Dylan Campbell,
and Stephen Gould
- Abstract summary: We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata.
Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
- Score: 61.09176771931052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a framework to recognize temporal compositions of atomic
actions in videos. Specifically, we propose to express temporal compositions of
actions as semantic regular expressions and derive an inference framework using
probabilistic automata to recognize complex actions as satisfying these
expressions on the input video features. Our approach is different from
existing works that either predict long-range complex activities as unordered
sets of atomic actions, or retrieve videos using natural language sentences.
Instead, the proposed approach allows recognizing complex fine-grained
activities using only pretrained action classifiers, without requiring any
additional data, annotations or neural network training. To evaluate the
potential of our approach, we provide experiments on synthetic datasets and
challenging real action recognition datasets, such as MultiTHUMOS and Charades.
We conclude that the proposed approach can extend state-of-the-art primitive
action classifiers to vastly more complex activities without large performance
degradation.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Activity Grammars for Temporal Action Segmentation [71.03141719666972]
temporal action segmentation aims at translating an untrimmed activity video into a sequence of action segments.
This paper introduces an effective activity grammar to guide neural predictions for temporal action segmentation.
Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability.
arXiv Detail & Related papers (2023-12-07T12:45:33Z) - A Grammatical Compositional Model for Video Action Detection [24.546886938243393]
We present a novel Grammatical Compositional Model (GCM) for action detection based on typical And-Or graphs.
Our model exploits the intrinsic structures and latent relationships of actions in a hierarchical manner to harness both the compositionality of grammar models and the capability of expressing rich features of DNNs.
arXiv Detail & Related papers (2023-10-04T15:24:00Z) - Knowledge Prompting for Few-shot Action Recognition [20.973999078271483]
We propose a simple yet effective method, called knowledge prompting, to prompt a powerful vision-language model for few-shot classification.
We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base.
We feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame.
Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
arXiv Detail & Related papers (2022-11-22T06:05:17Z) - FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality
Assessment [93.09267863425492]
We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable.
We construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - Learning to Ask Conversational Questions by Optimizing Levenshtein
Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions.
RISE is able to pay attention to tokens that are related to conversational characteristics.
Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z) - A Dependency Syntactic Knowledge Augmented Interactive Architecture for
End-to-End Aspect-based Sentiment Analysis [73.74885246830611]
We propose a novel dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-to-end ABSA.
This model is capable of fully exploiting the syntactic knowledge (dependency relations and types) by leveraging a well-designed Dependency Relation Embedded Graph Convolutional Network (DreGcn)
Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-04T14:59:32Z) - Active Learning in Video Tracking [8.782204980889079]
We propose an adversarial approach for active learning with structured prediction domains that is tractable for matching.
We evaluate this approach algorithmically in an important structured prediction problems: object tracking in videos.
arXiv Detail & Related papers (2019-12-29T00:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.