GroundNLQ @ Ego4D Natural Language Queries Challenge 2023
- URL: http://arxiv.org/abs/2306.15255v1
- Date: Tue, 27 Jun 2023 07:27:52 GMT
- Title: GroundNLQ @ Ego4D Natural Language Queries Challenge 2023
- Authors: Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li,
Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou
- Abstract summary: To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
- Score: 73.12670280220992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present our champion solution for Ego4D Natural Language
Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a
video, an effective egocentric feature extractor and a powerful grounding model
are required. Motivated by this, we leverage a two-stage pre-training strategy
to train egocentric feature extractors and the grounding model on video
narrations, and further fine-tune the model on annotated data. In addition, we
introduce a novel grounding model GroundNLQ, which employs a multi-modal
multi-scale grounding module for effective video and text fusion and various
temporal intervals, especially for long videos. On the blind test set,
GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively,
and surpasses all other teams by a noticeable margin. Our code will be released
at\url{https://github.com/houzhijian/GroundNLQ}.
Related papers
- EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - ObjectNLQ @ Ego4D Episodic Memory Challenge 2024 [51.57555556405898]
We present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024.
Both challenges require the localization of actions within long video sequences using textual queries.
We introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information.
arXiv Detail & Related papers (2024-06-22T07:57:58Z) - Localizing Moments in Long Video Via Multimodal Guidance [51.72829274071017]
We propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows.
Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ)
arXiv Detail & Related papers (2023-02-26T18:19:24Z) - NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z) - A Simple Transformer-Based Model for Ego4D Natural Language Queries
Challenge [8.674624972031387]
This report describes our submission to the Ego4D Natural Language Queries (NLQ) Challenge.
Our solution inherits the point-based event representation from our prior work on temporal action localization, and develops a Transformer-based model for video grounding.
Without bells and whistles, our submission based on a single model achieves 12.64% Mean R@1 and is ranked 2nd on the public leaderboard.
arXiv Detail & Related papers (2022-11-16T06:33:37Z) - Egocentric Video-Language Pretraining @ Ego4D Challenge 2022 [74.04740069230692]
We propose a video-language pretraining solution citekevin2022egovlp for four Ego4D challenge tasks.
Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to several video downstream tasks.
arXiv Detail & Related papers (2022-07-04T12:47:16Z) - Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [22.299810960572348]
We propose a video-language pretraining solution citekevin2022egovlp for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG.
arXiv Detail & Related papers (2022-07-04T11:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.