A Simple Transformer-Based Model for Ego4D Natural Language Queries
Challenge
- URL: http://arxiv.org/abs/2211.08704v1
- Date: Wed, 16 Nov 2022 06:33:37 GMT
- Title: A Simple Transformer-Based Model for Ego4D Natural Language Queries
Challenge
- Authors: Sicheng Mo, Fangzhou Mu, Yin Li
- Abstract summary: This report describes our submission to the Ego4D Natural Language Queries (NLQ) Challenge.
Our solution inherits the point-based event representation from our prior work on temporal action localization, and develops a Transformer-based model for video grounding.
Without bells and whistles, our submission based on a single model achieves 12.64% Mean R@1 and is ranked 2nd on the public leaderboard.
- Score: 8.674624972031387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report describes Badgers@UW-Madison, our submission to the Ego4D Natural
Language Queries (NLQ) Challenge. Our solution inherits the point-based event
representation from our prior work on temporal action localization, and
develops a Transformer-based model for video grounding. Further, our solution
integrates several strong video features including SlowFast, Omnivore and
EgoVLP. Without bells and whistles, our submission based on a single model
achieves 12.64% Mean R@1 and is ranked 2nd on the public leaderboard.
Meanwhile, our method garners 28.45% (18.03%) R@5 at tIoU=0.3 (0.5), surpassing
the top-ranked solution by up to 5.5 absolute percentage points.
Related papers
- PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 [73.12670280220992]
To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
arXiv Detail & Related papers (2023-06-27T07:27:52Z) - InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges [66.62885923201543]
We present our champion solutions to five tracks at Ego4D challenge.
We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks.
InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks.
arXiv Detail & Related papers (2022-11-17T13:45:06Z) - Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D
Moment Queries Challenge [7.718326034763966]
Our submission builds on ActionFormer, the state-of-the-art backbone for temporal action localization, and a trio of strong video features from SlowFast, Omnivore and Ego.
Our solution is ranked 2nd on the public leaderboard with 21.76% average mAP on the test set, which is nearly three times higher than the official baseline.
arXiv Detail & Related papers (2022-11-16T17:43:26Z) - Egocentric Video-Language Pretraining @ Ego4D Challenge 2022 [74.04740069230692]
We propose a video-language pretraining solution citekevin2022egovlp for four Ego4D challenge tasks.
Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to several video downstream tasks.
arXiv Detail & Related papers (2022-07-04T12:47:16Z) - Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [22.299810960572348]
We propose a video-language pretraining solution citekevin2022egovlp for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG.
arXiv Detail & Related papers (2022-07-04T11:32:48Z) - ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022 [61.81899056005645]
Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained.
We propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips.
The experimental results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2022-07-01T12:48:35Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.