Advancing Egocentric Video Question Answering with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2504.04550v1
- Date: Sun, 06 Apr 2025 16:58:23 GMT
- Title: Advancing Egocentric Video Question Answering with Multimodal Large Language Models
- Authors: Alkesh Patel, Vibhav Chitalia, Yinfei Yang,
- Abstract summary: Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement.<n>This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2.
- Score: 10.111636068164504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2 - a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate annotation noise in QaEgo4D, enabling more reliable comparison. Our results show that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct achieve new state-of-the-art performance, surpassing previous benchmarks by up to +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We also present a thorough error analysis, indicating the model's difficulty in spatial reasoning and fine-grained object recognition - key areas for future improvement.
Related papers
- Towards Fine-Grained Video Question Answering [17.582244704442747]
This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset.<n>With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding.<n>We present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding.
arXiv Detail & Related papers (2025-03-10T01:02:01Z) - A Benchmark for Crime Surveillance Video Analysis with Large Models [22.683394427744616]
Anomaly analysis in surveillance videos is a crucial topic in computer vision.<n>In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains.<n>We propose a benchmark for crime surveillance video analysis with large models denoted as UCVL.
arXiv Detail & Related papers (2025-02-13T13:38:17Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.
We automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models [53.64461404882853]
Video quality assessment (VQA) algorithms are needed to monitor and optimize the quality of streaming videos.
Here, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel visual modeling strategy for quality-aware feature extraction.
arXiv Detail & Related papers (2024-08-26T04:29:52Z) - TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models [32.6243916760583]
We present a framework for measuring two core capabilities of video comprehension: appearance and motion understanding.
We introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos.
Our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L)
arXiv Detail & Related papers (2024-08-21T03:56:27Z) - AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering [6.088350050879401]
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering.
The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation.
By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator.
arXiv Detail & Related papers (2023-11-25T02:46:12Z) - ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.