HCQA @ Ego4D EgoSchema Challenge 2024
- URL: http://arxiv.org/abs/2406.15771v2
- Date: Tue, 29 Oct 2024 02:38:27 GMT
- Title: HCQA @ Ego4D EgoSchema Challenge 2024
- Authors: Haoyu Zhang, Yuquan Xie, Yisen Feng, Zaijing Li, Meng Liu, Liqiang Nie,
- Abstract summary: We propose a novel scheme for egocentric video Question Answering, named HCQA.
It consists of three stages: Fine-grained Caption Generation, Context-driven Summarization, and Inference-guided Answering.
On a blind test set, HCQA achieves 75% accuracy in answering over 5,000 human-choice questions.
- Score: 51.57555556405898
- License:
- Abstract: In this report, we present our champion solution for Ego4D EgoSchema Challenge in CVPR 2024. To deeply integrate the powerful egocentric captioning model and question reasoning model, we propose a novel Hierarchical Comprehension scheme for egocentric video Question Answering, named HCQA. It consists of three stages: Fine-grained Caption Generation, Context-driven Summarization, and Inference-guided Answering. Given a long-form video, HCQA captures local detailed visual information and global summarised visual information via Fine-grained Caption Generation and Context-driven Summarization, respectively. Then in Inference-guided Answering, HCQA utilizes this hierarchical information to reason and answer given question. On the EgoSchema blind test set, HCQA achieves 75% accuracy in answering over 5,000 human curated multiple-choice questions. Our code will be released at https://github.com/Hyu-Zhang/HCQA.
Related papers
- Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - Improving Zero-shot Visual Question Answering via Large Language Models
with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models.
We generate self-contained questions as reasoning question prompts via an unsupervised question edition module.
Each reasoning question prompt clearly indicates the intent of the original question.
Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z) - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
Understanding [53.275916136138996]
Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data.
For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip.
We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
arXiv Detail & Related papers (2023-08-17T17:59:59Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z) - Exploring Anchor-based Detection for Ego4D Natural Language Query [74.87656676444163]
This paper presents technique report of Ego4D natural language query challenge in CVPR 2022.
We propose our solution of this challenge to solve the above issues.
arXiv Detail & Related papers (2022-08-10T14:43:37Z) - Egocentric Video-Language Pretraining @ Ego4D Challenge 2022 [74.04740069230692]
We propose a video-language pretraining solution citekevin2022egovlp for four Ego4D challenge tasks.
Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to several video downstream tasks.
arXiv Detail & Related papers (2022-07-04T12:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.