Technical Report for Ego4D Long Term Action Anticipation Challenge 2023
- URL: http://arxiv.org/abs/2307.01467v1
- Date: Tue, 4 Jul 2023 04:12:49 GMT
- Title: Technical Report for Ego4D Long Term Action Anticipation Challenge 2023
- Authors: Tatsuya Ishibashi, Kosuke Ono, Noriyuki Kugo, Yuji Sato
- Abstract summary: We describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023.
The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video.
Our method outperformed the baseline performance and recorded as second place solution on the public leaderboard.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this report, we describe the technical details of our approach for the
Ego4D Long-Term Action Anticipation Challenge 2023. The aim of this task is to
predict a sequence of future actions that will take place at an arbitrary time
or later, given an input video. To accomplish this task, we introduce three
improvements to the baseline model, which consists of an encoder that generates
clip-level features from the video, an aggregator that integrates multiple
clip-level features, and a decoder that outputs Z future actions. 1) Model
ensemble of SlowFast and SlowFast-CLIP; 2) Label smoothing to relax order
constraints for future actions; 3) Constraining the prediction of the action
class (verb, noun) based on word co-occurrence. Our method outperformed the
baseline performance and recorded as second place solution on the public
leaderboard.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - Palm: Predicting Actions through Language Models @ Ego4D Long-Term
Action Anticipation Challenge 2023 [100.32802766127776]
Palm is a solution to the Long-Term Action Anticipation task utilizing vision-language and large language models.
It predicts future actions based on frame descriptions and action labels extracted from the input videos.
arXiv Detail & Related papers (2023-06-28T20:33:52Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Exploiting Feature Diversity for Make-up Temporal Video Grounding [15.358540603177547]
This report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022.
MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description.
arXiv Detail & Related papers (2022-08-12T09:03:25Z) - Video + CLIP Baseline for Ego4D Long-term Action Anticipation [50.544635516455116]
Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network.
We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation.
arXiv Detail & Related papers (2022-07-01T17:57:28Z) - Context-aware Proposal Network for Temporal Action Detection [47.72048484299649]
This report presents our first place winning solution for temporal action detection task in CVPR-2022 AcitivityNet Challenge.
The task aims to localize temporal boundaries of action instances with specific classes in long untrimmed videos.
We argue that the generated proposals contain rich contextual information, which may benefits detection confidence prediction.
arXiv Detail & Related papers (2022-06-18T01:43:43Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z) - Compositional Video Synthesis with Action Graphs [112.94651460161992]
Videos of actions are complex signals containing rich compositional structure in space and time.
We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task.
Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
arXiv Detail & Related papers (2020-06-27T09:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.