Exploring Anchor-based Detection for Ego4D Natural Language Query
- URL: http://arxiv.org/abs/2208.05375v1
- Date: Wed, 10 Aug 2022 14:43:37 GMT
- Title: Exploring Anchor-based Detection for Ego4D Natural Language Query
- Authors: Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
- Abstract summary: This paper presents technique report of Ego4D natural language query challenge in CVPR 2022.
We propose our solution of this challenge to solve the above issues.
- Score: 74.87656676444163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we provide the technique report of Ego4D natural language query
challenge in CVPR 2022. Natural language query task is challenging due to the
requirement of comprehensive understanding of video contents. Most previous
works address this task based on third-person view datasets while few research
interest has been placed in the ego-centric view by far. Great progress has
been made though, we notice that previous works can not adapt well to
ego-centric view datasets e.g., Ego4D mainly because of two reasons: 1) most
queries in Ego4D have a excessively small temporal duration (e.g., less than 5
seconds); 2) queries in Ego4D are faced with much more complex video
understanding of long-term temporal orders. Considering these, we propose our
solution of this challenge to solve the above issues.
Related papers
- EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
Understanding [53.275916136138996]
Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data.
For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip.
We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
arXiv Detail & Related papers (2023-08-17T17:59:59Z) - Egocentric Video Task Translation @ Ego4D Challenge 2022 [109.30649877677257]
The EgoTask Translation approach explores relations among a set of egocentric video tasks in the Ego4D challenge.
We propose to leverage existing models developed for other related tasks and design a task that learns to ''translate'' auxiliary task features to the primary task.
Our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR localization challenge.
arXiv Detail & Related papers (2023-02-03T18:05:49Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.