Related papers: Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

URL: http://arxiv.org/abs/2601.10228v1
Date: Thu, 15 Jan 2026 09:43:49 GMT
Title: Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
Authors: Sicheng Yang, Yukai Huang, Shitong Sun, Weitong Cai, Jiankang Deng, Jifei Song, Zhensong Zhang,
Abstract summary: We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, and a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning.<n>This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding.
Score: 52.31833115696867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.

Related papers

Vidi2: Large Multimodal Models for Video Understanding and Creation [39.82972197371385]
Vidi2 video understanding with finegrained-temporal grounding (STG) and advances its capability to video question answering (Video QA)<n>Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges.<n>This end-to-endUE-temporal grounding capability enables potential applications in complex editing scenarios.
arXiv Detail & Related papers (2025-11-24T07:58:29Z)
Kwai Keye-VL 1.5 Technical Report [91.07838286692815]
We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations.<n>First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity.<n>Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens.<n>Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment.
arXiv Detail & Related papers (2025-09-01T15:46:58Z)
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding [33.58579390725519]
Video-MTR is a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension.<n>Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns.<n>To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system.
arXiv Detail & Related papers (2025-08-28T06:55:08Z)
Advancing Egocentric Video Question Answering with Multimodal Large Language Models [10.111636068164504]
Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement.<n>This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2.
arXiv Detail & Related papers (2025-04-06T16:58:23Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models [53.64461404882853]
Video quality assessment (VQA) algorithms are needed to monitor and optimize the quality of streaming videos. Here, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel visual modeling strategy for quality-aware feature extraction.
arXiv Detail & Related papers (2024-08-26T04:29:52Z)
A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA) Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4) Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.