CoMa: Contextual Massing Generation with Vision-Language Models
- URL: http://arxiv.org/abs/2601.08464v1
- Date: Tue, 13 Jan 2026 11:44:00 GMT
- Title: CoMa: Contextual Massing Generation with Vision-Language Models
- Authors: Evgenii Maslov, Valentin Khrulkov, Anastasia Volkova, Anton Gusarov, Andrey Kuznetsov, Ivan Oseledets,
- Abstract summary: We propose an automated framework for generating building massing based on functional requirements and site context.<n>A primary obstacle to such data-driven methods has been the lack of suitable datasets.<n>We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models.
- Score: 7.943264761730892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.
Related papers
- A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts [2.0391237204597368]
The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification.<n>The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets.
arXiv Detail & Related papers (2025-09-17T18:37:24Z) - From Parameters to Performance: A Data-Driven Study on LLM Structure and Development [73.67759647072519]
Large language models (LLMs) have achieved remarkable success across various domains.<n>Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce.<n>We present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks.
arXiv Detail & Related papers (2025-09-14T12:20:39Z) - Video Understanding by Design: How Datasets Shape Architectures and Insights [47.846604113207206]
Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures.<n>This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode.
arXiv Detail & Related papers (2025-09-11T05:06:30Z) - OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring [4.795391174842949]
Construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring.<n>Despite growing interest in visual datasets, existing resources vary widely in sizes, quality, and representativeness of real-world construction conditions.<n>This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development.
arXiv Detail & Related papers (2025-08-15T13:56:21Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models [104.17057231661371]
Time series analysis is crucial for understanding dynamics of complex systems.<n>Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs)<n>Their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints.<n>This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.
arXiv Detail & Related papers (2025-03-14T13:53:46Z) - Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.55649666025926]
We introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities.
Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans.
We propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans.
arXiv Detail & Related papers (2024-09-22T00:30:11Z) - PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations.
Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios.
We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - T-METASET: Task-Aware Generation of Metamaterial Datasets by
Diversity-Based Active Learning [14.668178146934588]
We propose t-METASET: an intelligent data acquisition framework for task-aware dataset generation.
We validate the proposed framework in three hypothetical deployment scenarios, which encompass general use, task-aware use, and tailorable use.
arXiv Detail & Related papers (2022-02-21T22:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.