Fugu-MT 論文翻訳(概要): Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

論文の概要: Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

arxiv url: http://arxiv.org/abs/2604.21102v1
Date: Wed, 22 Apr 2026 21:42:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.191506
Title: Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery
Title（参考訳）: 建築環境におけるマルチモーダルLLMの活用とストリートビュー画像による住宅属性評価
Authors: Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, Arnav Cherukuthota, Meghan Forstchen, Alexis Korotasz, Matthew Sisk, Ming Hu, Chaoli Wang,
Abstract要約: 本稿では,大規模言語モデル (LLM) とGoogleストリートビュー (GSV) の画像を活用することで,全米の建築条件を自動的に評価する新しい枠組みを提案する。提案手法は, SRCC, PLCCにおいて, 平均評価スコア (MOS) と強い整合性を実現し, 個人でも高い成績を示した。我々のフレームワークは、大規模建築条件評価のための柔軟で効率的なソリューションを提供し、人間のラベル付けを最小限に抑えることで高い精度を実現する。
参考スコア（独自算出の注目度）: 11.903829789742725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.
Abstract（参考訳）: 本稿では,大規模言語モデル (LLM) とGoogleストリートビュー (GSV) の画像を活用することで,全米の建築条件を自動的に評価する新しい枠組みを提案する。厳密な人間ラベル付きデータセット上でGemma 3 27Bを微調整することにより、人間の平均世論スコア(MOS)との強い整合性を達成し、MOSベンチマークと比較してSRCCとPLCCの個々のレーダよりも優れる。効率を向上させるために,知識蒸留を適用し,Gemma 3 27Bの能力を3倍の高速化で同等の性能を達成できるより小さなGemma 3 4Bモデルに転送する。さらに,その知識をCNNベースのモデル (EfficientNetV2-M) と変換器 (SwinV2-B) に抽出し,30倍の高速化を実現した。さらに、人間とAIのアライメント研究を通じて、建築環境と住宅属性の広範なリストを評価できるLCMの能力について検討し、住宅所有者による下流分析のためのLCM評価結果を統合する可視化ダッシュボードを開発した。我々のフレームワークは、大規模建築条件評価のための柔軟で効率的なソリューションを提供し、人間のラベル付けを最小限に抑えることで高い精度を実現する。

論文の概要: Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

関連論文リスト