Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
- URL: http://arxiv.org/abs/2601.10228v1
- Date: Thu, 15 Jan 2026 09:43:49 GMT
- Title: Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
- Authors: Sicheng Yang, Yukai Huang, Shitong Sun, Weitong Cai, Jiankang Deng, Jifei Song, Zhensong Zhang,
- Abstract summary: We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, and a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning.<n>This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding.
- Score: 52.31833115696867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.
Related papers
- Vidi2: Large Multimodal Models for Video Understanding and Creation [39.82972197371385]
Vidi2 video understanding with finegrained-temporal grounding (STG) and advances its capability to video question answering (Video QA)<n>Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges.<n>This end-to-endUE-temporal grounding capability enables potential applications in complex editing scenarios.
arXiv Detail & Related papers (2025-11-24T07:58:29Z) - Kwai Keye-VL 1.5 Technical Report [91.07838286692815]
We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations.<n>First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity.<n>Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens.<n>Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment.
arXiv Detail & Related papers (2025-09-01T15:46:58Z) - Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding [33.58579390725519]
Video-MTR is a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension.<n>Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns.<n>To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system.
arXiv Detail & Related papers (2025-08-28T06:55:08Z) - Advancing Egocentric Video Question Answering with Multimodal Large Language Models [10.111636068164504]
Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement.<n>This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2.
arXiv Detail & Related papers (2025-04-06T16:58:23Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models [53.64461404882853]
Video quality assessment (VQA) algorithms are needed to monitor and optimize the quality of streaming videos.
Here, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel visual modeling strategy for quality-aware feature extraction.
arXiv Detail & Related papers (2024-08-26T04:29:52Z) - A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA)
Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4)
Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.