Related papers: Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

URL: http://arxiv.org/abs/2408.12821v1
Date: Fri, 23 Aug 2024 03:45:31 GMT
Title: Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery
Authors: Zhenyuan Yang, Xuhui Lin, Qinyi He, Ziye Huang, Zhengliang Liu, Hanqi Jiang, Peng Shu, Zihao Wu, Yiwei Li, Stephen Law, Gengchen Mai, Tianming Liu, Tao Yang,
Abstract summary: This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications.
Score: 25.420749355424377
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding, but highlight limitations in detailed recognition and counting tasks. While zero-shot learning shows potential, performance varies depending on the problem domains and image complexities. This study provides new insights into the strengths and weaknesses of multimodal foundation models for practical challenges in Street View Imagery, Built Environment, and Interior. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications at the intersection of computer vision and language.

Related papers

VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs [12.64051404166593]
This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements.
arXiv Detail & Related papers (2025-03-25T01:23:11Z)
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning [62.61187785810336]
ImageScope is a training-free, three-stage framework that unifies language-guided image retrieval tasks. In the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity. In the second and third stages, we reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally.
arXiv Detail & Related papers (2025-03-13T08:43:24Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
GalleryGPT: Analyzing Paintings with Large Multimodal Models [64.98398357569765]
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. We introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture.
arXiv Detail & Related papers (2024-08-01T11:52:56Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation [1.2070884166650049]
This study integrates the Segment Anything model, a segmentation foundation model, with vision language models to conduct text-prompt image segmentation on street view images for LFE estimation. Our proposed method significantly enhances the availability of LFE estimation to almost all properties in which the front door is visible in the street view image.
arXiv Detail & Related papers (2024-04-19T03:16:08Z)
The Revolution of Multimodal Large Language Models: A Survey [46.84953515670248]
Multimodal Large Language Models (MLLMs) can seamlessly integrate visual and textual modalities. This paper provides a review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques.
arXiv Detail & Related papers (2024-02-19T19:01:01Z)
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications [38.416917485939486]
This paper explores the capabilities of GPT-4V in the realms of geography, environmental science, agriculture, and urban planning. Data sources include satellite imagery, aerial photos, ground-level images, field images, and public datasets. The model is evaluated on a series of tasks including geo-localization, textual data extraction from maps, remote sensing image classification, visual question answering, crop type identification, disease/pest/weed recognition, chicken behavior analysis, agricultural object counting, urban planning knowledge question answering, and plan generation.
arXiv Detail & Related papers (2023-12-23T22:36:58Z)
Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently. We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting. We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Automatic Image Content Extraction: Operationalizing Machine Learning in Humanistic Photographic Studies of Large Visual Archives [81.88384269259706]
We introduce Automatic Image Content Extraction framework for machine learning-based search and analysis of large image archives. The proposed framework can be applied in several domains in humanities and social sciences.
arXiv Detail & Related papers (2022-04-05T12:19:24Z)
Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types [50.1843146606122]
A simple form of transfer learning is common in current state-of-the-art computer vision models. Previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains.
arXiv Detail & Related papers (2021-03-24T16:24:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.