Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D
Moment Queries Challenge
- URL: http://arxiv.org/abs/2211.09074v1
- Date: Wed, 16 Nov 2022 17:43:26 GMT
- Title: Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D
Moment Queries Challenge
- Authors: Fangzhou Mu, Sicheng Mo, Gillian Wang, Yin Li
- Abstract summary: Our submission builds on ActionFormer, the state-of-the-art backbone for temporal action localization, and a trio of strong video features from SlowFast, Omnivore and Ego.
Our solution is ranked 2nd on the public leaderboard with 21.76% average mAP on the test set, which is nearly three times higher than the official baseline.
- Score: 7.718326034763966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report describes our submission to the Ego4D Moment Queries Challenge
2022. Our submission builds on ActionFormer, the state-of-the-art backbone for
temporal action localization, and a trio of strong video features from
SlowFast, Omnivore and EgoVLP. Our solution is ranked 2nd on the public
leaderboard with 21.76% average mAP on the test set, which is nearly three
times higher than the official baseline. Further, we obtain 42.54% Recall@1x at
tIoU=0.5 on the test set, outperforming the top-ranked solution by a
significant margin of 1.41 absolute percentage points. Our code is available at
https://github.com/happyharrycn/actionformer_release.
Related papers
- NMS Threshold matters for Ego4D Moment Queries -- 2nd place solution to
the Ego4D Moment Queries Challenge 2023 [8.674624972031387]
This report describes our submission to the Ego4D Moment Queries Challenge 2023.
Our submission extends ActionFormer, a latest method for temporal action localization.
Our solution is ranked 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge.
arXiv Detail & Related papers (2023-07-05T05:23:49Z) - GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 [73.12670280220992]
To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
arXiv Detail & Related papers (2023-06-27T07:27:52Z) - Action Sensitivity Learning for the Ego4D Episodic Memory Challenge 2023 [41.10032280192564]
This report presents ReLER submission to two tracks in the Ego4D Episodic Memory Benchmark in CVPR 2023.
This solution inherits from our proposed Action Sensitivity Learning framework (ASL) to better capture discrepant information of frames.
arXiv Detail & Related papers (2023-06-15T14:50:17Z) - The Runner-up Solution for YouTube-VIS Long Video Challenge 2022 [72.13080661144761]
We adopt the previously proposed online video instance segmentation method IDOL for this challenge.
We use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding.
The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second in this challenge.
arXiv Detail & Related papers (2022-11-18T01:40:59Z) - ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022 [42.02602065259257]
We present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2022.
The goal is to retrieve and localize all instances of possible activities in egocentric videos.
The final submission achieved Recall@1,tIoU=0.5 score of 37.24, average mAP score of 17.67 and took 3-rd place on the leaderboard.
arXiv Detail & Related papers (2022-11-17T14:28:31Z) - A Simple Transformer-Based Model for Ego4D Natural Language Queries
Challenge [8.674624972031387]
This report describes our submission to the Ego4D Natural Language Queries (NLQ) Challenge.
Our solution inherits the point-based event representation from our prior work on temporal action localization, and develops a Transformer-based model for video grounding.
Without bells and whistles, our submission based on a single model achieves 12.64% Mean R@1 and is ranked 2nd on the public leaderboard.
arXiv Detail & Related papers (2022-11-16T06:33:37Z) - Egocentric Video-Language Pretraining @ Ego4D Challenge 2022 [74.04740069230692]
We propose a video-language pretraining solution citekevin2022egovlp for four Ego4D challenge tasks.
Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to several video downstream tasks.
arXiv Detail & Related papers (2022-07-04T12:47:16Z) - Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [22.299810960572348]
We propose a video-language pretraining solution citekevin2022egovlp for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG.
arXiv Detail & Related papers (2022-07-04T11:32:48Z) - ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022 [61.81899056005645]
Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained.
We propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips.
The experimental results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2022-07-01T12:48:35Z) - NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results [279.8098140331206]
The NTIRE 2022 challenge was to super-resolve an input image with a magnification factor of $times$4 based on pairs of low and corresponding high resolution images.
The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics.
arXiv Detail & Related papers (2022-05-11T17:58:54Z) - Top-1 Solution of Multi-Moments in Time Challenge 2019 [56.15819266653481]
We conduct several experiments with popular Image-Based action recognition methods TRN, TSN, and TSM.
A novel temporal interlacing network is proposed towards fast and accurate recognition.
We ensemble all the above models and achieve 67.22% on the validation set and 60.77% on the test set, which ranks 1st on the final leaderboard.
arXiv Detail & Related papers (2020-03-12T15:11:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.