Related papers: ArchGPT: Understanding the World's Architectures with Large Multimodal Models

ArchGPT: Understanding the World's Architectures with Large Multimodal Models

URL: http://arxiv.org/abs/2509.20858v1
Date: Thu, 25 Sep 2025 07:49:43 GMT
Title: ArchGPT: Understanding the World's Architectures with Large Multimodal Models
Authors: Yuze Wang, Luo Yang, Junyi Wang, Yue Qi,
Abstract summary: We present ArchGPT, a multimodal architectural visual question answering (VQA) model.<n>This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets.
Score: 6.504675786709239
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.

Related papers

Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z)
A Sketch+Text Composed Image Retrieval Dataset for Thangka [14.600552992453977]
Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities.<n>CIRThan is a sketch+text Composed Image Retrieval dataset for Thangka imagery.
arXiv Detail & Related papers (2026-02-09T09:14:29Z)
Video Understanding by Design: How Datasets Shape Architectures and Insights [47.846604113207206]
Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures.<n>This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode.
arXiv Detail & Related papers (2025-09-11T05:06:30Z)
Taking Language Embedded 3D Gaussian Splatting into the Wild [6.550474097747006]
We propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections.<n>Specifically, we first render multiple appearance images from the same viewpoint, then extract multi-appearance CLIP features.<n>We then propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features.
arXiv Detail & Related papers (2025-07-26T07:00:32Z)
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z)
OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery [4.33299613844962]
Building properties play a crucial role in spatial data infrastructures, supporting applications such as energy simulation, risk assessment, and environmental modeling.<n>Recent advances have enabled the extraction and tagging of objective building attributes using remote sensing and street-level imagery.<n>This study bridges the gaps by introducing OpenFACADES, an open framework that leverages crowdsourced data to enrich building profiles.
arXiv Detail & Related papers (2025-04-01T08:20:13Z)
Multi-View Depth Consistent Image Generation Using Generative AI Models: Application on Architectural Design of University Buildings [20.569648863933285]
We propose a novel three-stage consistent image generation framework using generative AI models.<n>We employ ControlNet as the backbone and optimize it to accommodate multi-view inputs of architectural shoebox models.<n> Experimental results demonstrate that the proposed framework can generate multi-view architectural images with consistent style and structural coherence.
arXiv Detail & Related papers (2025-03-05T00:16:09Z)
CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering [12.299096433876676]
Current state-of-the-art 3D reconstruction models face limitations in building extra-large scale outdoor scenes.<n>In this paper, we present a extra-large fine-grained dataset with 10 billion points composed of 41,006 drone-captured high-resolution aerial images.<n>Compared to existing datasets, ours offers significantly larger scale and higher detail, uniquely suited for fine-grained 3D applications.
arXiv Detail & Related papers (2025-01-12T20:36:39Z)
Serving Deep Learning Model in Relational Databases [70.53282490832189]
Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains. We highlight three pivotal paradigms: The state-of-the-art DL-centric architecture offloads DL computations to dedicated DL frameworks. The potential UDF-centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS)
arXiv Detail & Related papers (2023-10-07T06:01:35Z)
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks [4.093474663507322]
Bridge-architectures project from the image space to the text space to solve tasks such as VQA, captioning, and image retrieval. We extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2.
arXiv Detail & Related papers (2023-07-31T03:57:31Z)
General-purpose, long-context autoregressive modeling with Perceiver AR [58.976153199352254]
We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to latents. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
arXiv Detail & Related papers (2022-02-15T22:31:42Z)
Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals. Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs. The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.