Related papers: Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

URL: http://arxiv.org/abs/2511.19537v1
Date: Mon, 24 Nov 2025 10:26:30 GMT
Title: Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment
Authors: Muhao Guo, Yang Weng,
Abstract summary: This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment.<n>By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema.<n>Cross-regional evaluation using the $$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions.
Score: 5.156484100374059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

Related papers

Vision-LLMs for Spatiotemporal Traffic Forecasting [14.700408329373998]
Large Language Models (LLMs) inherently struggle to model the complex spatial dependencies of grid-based traffic data.<n>We propose ST-Vision-LLM, a novel framework reframe thatstemporal forecasting as a vision-language fusion problem.<n>We show that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-13T11:15:56Z)
Solar Photovoltaic Assessment with Large Language Model [5.156484100374059]
We investigate how large language models (LLMs) can be leveraged to overcome solar panel detection challenges.<n>LLMs face several challenges in solar panel detection, including difficulties with multi-step logical processes.<n>We propose the PV Assessment with LLMs framework, which incorporates task decomposition for more efficient output standardization.
arXiv Detail & Related papers (2025-07-25T10:26:29Z)
Globalization for Scalable Short-term Load Forecasting [7.654516721062505]
This paper investigates global load forecasting in the presence of data drifts.<n>We show how globalization, data heterogeneity, and data drift affect each differently.<n>We also examine the role of globalization in peak load forecasting and its potential for hierarchical forecasting.
arXiv Detail & Related papers (2025-07-15T20:58:14Z)
LM-Net: A Light-weight and Multi-scale Network for Medical Image Segmentation [7.963884317408774]
Current medical image segmentation approaches have limitations in deeply exploring multi-scale information.<n>We propose a novel, lightweight, and multi-scale architecture (LM-Net) to enhance segmentation accuracy.<n>Our proposed model achieves state-of-the-art results, surpassing previous methods, while only requiring 4.66G FLOPs and 5.4M parameters.
arXiv Detail & Related papers (2025-01-07T14:47:15Z)
Multisource Collaborative Domain Generalization for Cross-Scene Remote Sensing Image Classification [57.945437355714155]
Cross-scene image classification aims to transfer prior knowledge of ground materials to annotate regions with different distributions.<n>Existing approaches focus on single-source domain generalization to unseen target domains.<n>We propose a novel multi-source collaborative domain generalization framework (MS-CDG) based on homogeneity and heterogeneity characteristics of multi-source remote sensing data.
arXiv Detail & Related papers (2024-12-05T06:15:08Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z)
WorldGPT: Empowering LLM as Multimodal World Model [51.243464216500975]
We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM) WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. We conduct evaluations on WorldNet, a multimodal state transition prediction benchmark.
arXiv Detail & Related papers (2024-04-28T14:42:02Z)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities. LVLMs are often problematic due to their massive computational/energy costs and carbon consumption. We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)
Federated Learning With Quantized Global Model Updates [84.55126371346452]
We study federated learning, which enables mobile devices to utilize their local datasets to train a global model. We introduce a lossy FL (LFL) algorithm, in which both the global model and the local model updates are quantized before being transmitted.
arXiv Detail & Related papers (2020-06-18T16:55:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.