Related papers: HCQA-1.5 @ Ego4D EgoSchema Challenge 2025

HCQA-1.5 @ Ego4D EgoSchema Challenge 2025

URL: http://arxiv.org/abs/2505.20644v1
Date: Tue, 27 May 2025 02:45:14 GMT
Title: HCQA-1.5 @ Ego4D EgoSchema Challenge 2025
Authors: Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie,
Abstract summary: We present a method that achieves third place for Ego4D Egocentric Challenge in CVPR 2025.<n>Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism.<n>Our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions.
Score: 77.414837862995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this report, we present the method that achieves third place for Ego4D EgoSchema Challenge in CVPR 2025. To improve the reliability of answer prediction in egocentric video question answering, we propose an effective extension to the previously proposed HCQA framework. Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism that selects high-confidence answers directly. For low-confidence cases, we incorporate a fine-grained reasoning module that performs additional visual and contextual analysis to refine the predictions. Evaluated on the EgoSchema blind test set, our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions, outperforming last year's winning solution and the majority of participating teams. Our code will be added at https://github.com/Hyu-Zhang/HCQA.

Related papers

Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment [45.92718704785823]
We propose a novel and comprehensive framework that explores 365'' aspects of interview performance.<n>The framework employs modality-specific feature extractors to encode heterogeneous data streams.<n>By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data.
arXiv Detail & Related papers (2025-07-30T13:37:06Z)
DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025 [0.0]
In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025.<n>It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories.<n>Our method, DIVE, adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference.
arXiv Detail & Related papers (2025-06-27T04:05:12Z)
Correct after Answer: Enhancing Multi-Span Question Answering with Post-Processing Method [11.794628063040108]
Multi-Span Question Answering (MSQA) requires models to extract one or multiple answer spans from a given context to answer a question. We propose Answering-Classifying-Correcting (ACC) framework, which employs a post-processing strategy to handle incorrect predictions.
arXiv Detail & Related papers (2024-10-22T08:04:32Z)
HCQA @ Ego4D EgoSchema Challenge 2024 [51.57555556405898]
We propose a novel scheme for egocentric video Question Answering, named HCQA. It consists of three stages: Fine-grained Caption Generation, Context-driven Summarization, and Inference-guided Answering. On a blind test set, HCQA achieves 75% accuracy in answering over 5,000 human-choice questions.
arXiv Detail & Related papers (2024-06-22T07:20:39Z)
Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning [67.82016092549284]
We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system. ReCoVERR tries to find relevant clues in an image that provide additional evidence for the prediction.
arXiv Detail & Related papers (2024-02-23T21:16:52Z)
EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding [99.904140768186]
This paper proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA) We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy. We believe our data and the findings will pave a new way for Ego-HOI understanding.
arXiv Detail & Related papers (2023-09-05T17:51:16Z)
Improving Selective Visual Question Answering by Learning from Your Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong. We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z)
Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM) We propose a decoding algorithm integrating the self-evaluation guidance via beam search. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022 [22.299810960572348]
We propose a video-language pretraining solution citekevin2022egovlp for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG.
arXiv Detail & Related papers (2022-07-04T11:32:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.