ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022
- URL: http://arxiv.org/abs/2207.00383v1
- Date: Fri, 1 Jul 2022 12:48:35 GMT
- Title: ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022
- Authors: Naiyuan Liu, Xiaohan Wang, Xiaobo Li, Yi Yang, Yueting Zhuang
- Abstract summary: Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained.
We propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips.
The experimental results demonstrate the effectiveness of our method.
- Score: 61.81899056005645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D
Natural Language Queries (NLQ) Challenge in CVPR 2022. Given a video clip and a
text query, the goal of this challenge is to locate a temporal moment of the
video clip where the answer to the query can be obtained. To tackle this task,
we propose a multi-scale cross-modal transformer and a video frame-level
contrastive loss to fully uncover the correlation between language queries and
video clips. Besides, we propose two data augmentation strategies to increase
the diversity of training samples. The experimental results demonstrate the
effectiveness of our method. The final submission ranked first on the
leaderboard.
Related papers
- First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge [4.075139470537149]
We present our first-place solution to the Multiple-choice Video Question Answering track of The Second Perception Test Challenge.
This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content.
arXiv Detail & Related papers (2024-09-20T14:31:13Z) - GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 [73.12670280220992]
To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
arXiv Detail & Related papers (2023-06-27T07:27:52Z) - Action Sensitivity Learning for the Ego4D Episodic Memory Challenge 2023 [41.10032280192564]
This report presents ReLER submission to two tracks in the Ego4D Episodic Memory Benchmark in CVPR 2023.
This solution inherits from our proposed Action Sensitivity Learning framework (ASL) to better capture discrepant information of frames.
arXiv Detail & Related papers (2023-06-15T14:50:17Z) - NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z) - The Runner-up Solution for YouTube-VIS Long Video Challenge 2022 [72.13080661144761]
We adopt the previously proposed online video instance segmentation method IDOL for this challenge.
We use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding.
The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second in this challenge.
arXiv Detail & Related papers (2022-11-18T01:40:59Z) - ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022 [42.02602065259257]
We present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2022.
The goal is to retrieve and localize all instances of possible activities in egocentric videos.
The final submission achieved Recall@1,tIoU=0.5 score of 37.24, average mAP score of 17.67 and took 3-rd place on the leaderboard.
arXiv Detail & Related papers (2022-11-17T14:28:31Z) - Team PKU-WICT-MIPL PIC Makeup Temporal Video Grounding Challenge 2022
Technical Report [42.49264486550348]
We propose a phrase relationship mining framework to exploit the temporal localization relationship relevant to the fine-grained phrase and the whole sentence.
Besides, we propose to constrain the localization results of different step sentence queries to not overlap with each other.
Our final submission ranked 2nd on the leaderboard, with only a 0.55% gap from the first.
arXiv Detail & Related papers (2022-07-06T13:50:34Z) - AIM 2020 Challenge on Video Temporal Super-Resolution [118.46127362093135]
Second AIM challenge on Video Temporal Super-Resolution (VTSR)
This paper reports the second AIM challenge on Video Temporal Super-Resolution (VTSR)
arXiv Detail & Related papers (2020-09-28T00:10:29Z) - AIM 2019 Challenge on Video Temporal Super-Resolution: Methods and
Results [129.15554076593762]
This paper reviews the first AIM challenge on video temporal super-resolution (frame)
From low-frame-rate (15 fps) video sequences, the challenge participants are asked to submit higher-framerate (60 fps) video sequences.
We employ the REDS VTSR dataset derived from diverse videos captured in a hand-held camera for training and evaluation purposes.
arXiv Detail & Related papers (2020-05-04T01:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.