Related papers: EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

URL: http://arxiv.org/abs/2311.15596v2
Date: Thu, 28 Mar 2024 11:35:55 GMT
Title: EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
Authors: Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu,
Abstract summary: Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. EgoThink is a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions.
Score: 21.410065053609877
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

Related papers

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [112.51671310005604]
We present GLM-4.1V-9B-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning.<n>We propose Reinforcement Learning with Curriculum Sampling to unlock the full potential of the model.<n>Open-source GLM-4.1V-9B-Thinking achieves state-of-the-art performance among models of comparable size.
arXiv Detail & Related papers (2025-07-01T17:55:04Z)
Comparing Learning Paradigms for Egocentric Video Summarization [0.0]
This study investigates computer vision paradigms by assessing their ability to understand and interpret egocentric video data.<n>We examine Shotluck Holmes (state-of-the-art supervised learning), TAC-SUM (state-of-the-art unsupervised learning), and GPT-4o (a prompt fine-tuned pre-trained model), evaluating their effectiveness in video summarization.
arXiv Detail & Related papers (2025-06-26T21:46:48Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories. These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark [0.8820880683910832]
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks. However, their ability to perform Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and mental states, remains underexplored. We propose an open-ended question framework to evaluate VLMs' performance across diverse categories of ToM tasks.
arXiv Detail & Related papers (2025-03-28T02:26:32Z)
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos? [20.199060287444162]
This work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AI-generated videos (AIGVs) UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Our results suggest that while advanced MLLMs still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation.
arXiv Detail & Related papers (2025-03-13T01:52:27Z)
EgoBlind: Towards Egocentric Visual Assistance for the Blind People [69.6161191190939]
EgoBlind is the first egocentric VideoQA dataset collected from blind individuals. It comprises 1,210 videos that record the daily lives of real blind users from a first-person perspective. It also features 4,927 questions directly posed or generated by blind individuals to reflect their needs for visual assistance.
arXiv Detail & Related papers (2025-03-11T09:40:31Z)
Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts [39.72461455275383]
We introduce Value-Spectrum, a benchmark aimed at assessing Vision-Language Models (VLMs) based on Schwartz's value dimensions. We constructed a vectorized database of over 50,000 short videos sourced from TikTok, YouTube Shorts, and Instagram Reels, covering multiple months and a wide array of topics. We also developed a VLM agent pipeline to automate video browsing and analysis.
arXiv Detail & Related papers (2024-11-18T11:31:10Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. We introduce AutoBench-V, an automated framework for serving evaluation on demand. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI. We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance. Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low. We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z)
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z)
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding [44.79843213164787]
Embodied AI personal assistants require embodied understanding to collaborate with humans effectively. Current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric experience. We introduce the Egocentric Video Understanding dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. We present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
arXiv Detail & Related papers (2024-06-19T20:14:14Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
Large Language Models as Automated Aligners for benchmarking Vision-Language Models [48.4367174400306]
Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
arXiv Detail & Related papers (2023-11-24T16:12:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.