EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
- URL: http://arxiv.org/abs/2502.07411v2
- Date: Fri, 21 Mar 2025 14:21:30 GMT
- Title: EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
- Authors: Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao,
- Abstract summary: We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text.<n>EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities.
- Score: 95.2396264550978
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance. Our dataset is released at: https://github.com/zhousheng97/EgoTextVQA.
Related papers
- ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos [71.62145804686062]
We introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs)<n>We propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality.<n> Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task.
arXiv Detail & Related papers (2024-12-29T10:13:30Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.
We automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - Scene-Text Grounding for Text-Based Video Question Answering [97.1112579979614]
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and reliance on scene-text recognition.
We study Grounded TextVideoQA by forcing models to answer questions and interpret relevant scene-text regions.
arXiv Detail & Related papers (2024-09-22T05:13:11Z) - AMEGO: Active Memory from long EGOcentric videos [26.04157621755452]
We introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos.
Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video.
This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content.
arXiv Detail & Related papers (2024-09-17T06:18:47Z) - EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Exploring Anchor-based Detection for Ego4D Natural Language Query [74.87656676444163]
This paper presents technique report of Ego4D natural language query challenge in CVPR 2022.
We propose our solution of this challenge to solve the above issues.
arXiv Detail & Related papers (2022-08-10T14:43:37Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z) - Data augmentation techniques for the Video Question Answering task [16.548016892117083]
We focus on the Egocentric VideoQA task, which exploits first-person videos.
Given its small size, models tend to overfit quickly.
We propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.
arXiv Detail & Related papers (2020-08-22T14:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.