Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks
- URL: http://arxiv.org/abs/2405.18831v1
- Date: Wed, 29 May 2024 07:20:28 GMT
- Title: Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks
- Authors: Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis,
- Abstract summary: We evaluate the zero-shot performance of foundational models on 3D VQA benchmarks.
We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches.
- Score: 13.899853299593012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community's ongoing efforts to refine multi-modal 3D benchmarks.
Related papers
- Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning [0.42855555838080844]
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought prompting and reinforcement learning.<n>We find that simple CoT formats, where the model generates a reasoning step before the answer, can harm the model's original performance.<n>In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy.
arXiv Detail & Related papers (2025-07-06T10:51:12Z) - Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning [6.537257913467247]
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD)
Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning.
Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios.
arXiv Detail & Related papers (2025-01-15T13:46:33Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.
Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective [44.045767657945895]
We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity.
We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions.
The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts.
arXiv Detail & Related papers (2024-07-21T18:08:44Z) - Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving [55.93813178692077]
We present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms.
We assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction.
Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data.
arXiv Detail & Related papers (2024-05-27T17:59:39Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples [34.71588837946776]
We propose CounterCurate, a framework to improve visio-linguistic compositional reasoning.
In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning.
We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning.
We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements.
arXiv Detail & Related papers (2024-02-20T18:59:55Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.