ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
- URL: http://arxiv.org/abs/2209.08199v4
- Date: Sun, 09 Feb 2025 21:09:17 GMT
- Title: ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
- Authors: Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Srinivas Sunkara, Victor Carbune, Jason Lin, Maria Wang, Yun Zhu, Jindong Chen,
- Abstract summary: ScreenQA is a novel benchmarking dataset designed to advance screen content understanding through question answering.
By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity.
We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings.
- Score: 8.176933082548093
- License:
- Abstract: We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We further demonstrate positive transfer to web applications, highlighting its potential beyond mobile applications.
Related papers
- Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries.
We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z) - WebQuest: A Benchmark for Multimodal QA on Web Page Sequences [10.008284460456107]
WebQuest is a multi-page question-answering dataset that requires reasoning across multiple web pages.
Our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages.
We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset.
arXiv Detail & Related papers (2024-09-06T18:44:25Z) - OmniParser for Pure Vision Based GUI Agent [37.911094082816504]
Power multimodal models like GPT-4V as a general agent on multiple operating systems are largely underestimated due to the lack of a robust screen parsing technique.
textsc Omni significantly improves GPT-4V's performance on ScreenSpot benchmark.
textsc Omni with screenshot only outperforms GPT-4V baselines requiring additional information outside of screenshot.
arXiv Detail & Related papers (2024-08-01T00:00:43Z) - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios.
AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels.
AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z) - LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation [8.998467488526327]
This paper presents LlamaTouch, a testbed for on-device mobile UI task execution and faithful, scalable task evaluation.
LlamaTouch employs a novel evaluation approach that only assesses whether an agent traverses all manually annotated, essential application/system states.
LlamaTouch also enables easy task annotation and integration of new mobile agents.
arXiv Detail & Related papers (2024-04-12T15:39:09Z) - SnapNTell: Enhancing Entity-Centric Visual Question Answering with
Retrieval Augmented Multimodal LLM [48.15067480282839]
This work introduces a novel evaluative benchmark named textbfSnapNTell, specifically tailored for entity-centric VQA.
The dataset is organized into 22 major categories, containing 7,568 unique entities in total.
Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5% improvement in the BELURT score.
arXiv Detail & Related papers (2024-03-07T18:38:17Z) - TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions.
By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning [34.24671403624908]
Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen.
We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase.
arXiv Detail & Related papers (2021-08-07T03:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.