EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
- URL: http://arxiv.org/abs/2502.07411v1
- Date: Tue, 11 Feb 2025 09:45:06 GMT
- Title: EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
- Authors: Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao,
- Abstract summary: We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text.
EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities.
- Score: 95.2396264550978
- License:
- Abstract: We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
Related papers
- ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos [71.62145804686062]
We introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs)
We propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality.
Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task.
arXiv Detail & Related papers (2024-12-29T10:13:30Z) - AMEGO: Active Memory from long EGOcentric videos [26.04157621755452]
We introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos.
Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video.
This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content.
arXiv Detail & Related papers (2024-09-17T06:18:47Z) - EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - Collaboration with Conversational AI Assistants for UX Evaluation:
Questions and How to Ask them (Voice vs. Text) [18.884080068561843]
We conducted a Wizard-of-Oz design probe study with 20 participants who interacted with simulated AI assistants via text or voice.
We found that participants asked for five categories of information: user actions, user mental model, help from the AI assistant, product and task information, and user demographics.
The text assistant was perceived as significantly more efficient, but both were rated equally in satisfaction and trust.
arXiv Detail & Related papers (2023-03-07T03:59:14Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Exploring Anchor-based Detection for Ego4D Natural Language Query [74.87656676444163]
This paper presents technique report of Ego4D natural language query challenge in CVPR 2022.
We propose our solution of this challenge to solve the above issues.
arXiv Detail & Related papers (2022-08-10T14:43:37Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z) - Data augmentation techniques for the Video Question Answering task [16.548016892117083]
We focus on the Egocentric VideoQA task, which exploits first-person videos.
Given its small size, models tend to overfit quickly.
We propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.
arXiv Detail & Related papers (2020-08-22T14:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.