Related papers: Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation

Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation

URL: http://arxiv.org/abs/2509.02615v1
Date: Sun, 31 Aug 2025 14:31:47 GMT
Title: Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation
Authors: Mariia Drozdova, Erica Lastufka, Vitaliy Kinakh, Taras Holotyak, Daniel Schaerer, Slava Voloshynovskiy,
Abstract summary: Vision-Language Models (VLMs) are positioned as general-purpose AI systems capable of reasoning across domains.<n>We assess whether generic VLMs, presumed to lack exposure to astronomical corpora, can perform morphology-based classification of radio galaxies.
Score: 5.711705587813085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs), such as recent Qwen and Gemini models, are positioned as general-purpose AI systems capable of reasoning across domains. Yet their capabilities in scientific imaging, especially on unfamiliar and potentially previously unseen data distributions, remain poorly understood. In this work, we assess whether generic VLMs, presumed to lack exposure to astronomical corpora, can perform morphology-based classification of radio galaxies using the MiraBest FR-I/FR-II dataset. We explore prompting strategies using natural language and schematic diagrams, and, to the best of our knowledge, we are the first to introduce visual in-context examples within prompts in astronomy. Additionally, we evaluate lightweight supervised adaptation via LoRA fine-tuning. Our findings reveal three trends: (i) even prompt-based approaches can achieve good performance, suggesting that VLMs encode useful priors for unfamiliar scientific domains; (ii) however, outputs are highly unstable, i.e. varying sharply with superficial prompt changes such as layout, ordering, or decoding temperature, even when semantic content is held constant; and (iii) with just 15M trainable parameters and no astronomy-specific pretraining, fine-tuned Qwen-VL achieves near state-of-the-art performance (3% Error rate), rivaling domain-specific models. These results suggest that the apparent "reasoning" of VLMs often reflects prompt sensitivity rather than genuine inference, raising caution for their use in scientific domains. At the same time, with minimal adaptation, generic VLMs can rival specialized models, offering a promising but fragile tool for scientific discovery.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
Life, Machine Learning, and the Search for Habitability: Predicting Biosignature Fluxes for the Habitable Worlds Observatory [0.0]
We introduce two advanced machine-learning architectures tailored for predicting biosignature species from exoplanetary reflected-light spectra.<n>We demonstrate that both models achieve comparably high predictive accuracy on an augmented dataset spanning a wide range of exoplanetary conditions.
arXiv Detail & Related papers (2026-01-18T19:43:48Z)
Simulation-Based Pretraining and Domain Adaptation for Astronomical Time Series with Minimal Labeled Data [0.12744523252873352]
We present a pre-training approach that leverages simulations, significantly reducing the need for labeled examples from real observations.<n>Our models, trained on simulated data from multiple astronomical surveys (ZTF and LSST), learn generalizable representations that transfer effectively to downstream tasks.<n>Remarkably, our models exhibit effective zero-shot transfer capabilities, achieving comparable performance on future telescope (LSST) simulations when trained solely on existing telescope (ZTF) data.
arXiv Detail & Related papers (2025-10-14T20:07:14Z)
Textual interpretation of transient image classifications from large language models [0.0]
Large language models (LLMs) can approach the performance level of a convolutional neural network on three optical transient survey datasets.<n>Google's LLM, Gemini, achieves a 93% average accuracy across datasets that span a range of resolution and pixel scales.
arXiv Detail & Related papers (2025-10-08T12:12:46Z)
Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments [41.33501105382656]
Vision-Language Model (VLM) for classifying neutrino interactions from pixelated detector images in high-energy physics experiments.<n>We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC.<n>Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context.
arXiv Detail & Related papers (2025-08-26T19:12:28Z)
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications [63.92604046592333]
Video foundation models (FMs) hold considerable promise as general-purpose domain-agnostic approaches.<n>We introduce SciVid, a benchmark comprising five tasks across medical computer vision, animal behavior, and weather forecasting.<n>We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating potential for effective transfer learning.
arXiv Detail & Related papers (2025-07-04T13:48:12Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)<n>Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization.<n> Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z)
At First Sight: Zero-Shot Classification of Astronomical Images with Large Multimodal Models [0.0]
Vision-Language multimodal Models (VLMs) offer the possibility for zero-shot classification in astronomy. We investigate two models, GPT-4o and LLaVA-NeXT, for zero-shot classification of low-surface brightness galaxies and artifacts. We show that with natural language prompts these models achieved significant accuracy (above 80 percent typically) without additional training/fine tuning.
arXiv Detail & Related papers (2024-06-24T18:17:54Z)
Personalized Adapter for Large Meteorology Model on Devices: Towards Weather Foundation Models [36.229082478423585]
LM-Weather is a generic approach to taming pre-trained language models (PLMs) We introduce a lightweight personalized adapter into PLMs and endow it with weather pattern awareness. Experiments show LM-Weather outperforms the state-of-the-art results by a large margin across various tasks.
arXiv Detail & Related papers (2024-05-24T15:25:09Z)
Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms. We propose a textbfDomain-Controlled Prompt Learning for the specific domains. Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.