Related papers: Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

URL: http://arxiv.org/abs/2512.09092v1
Date: Tue, 09 Dec 2025 20:10:43 GMT
Title: Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters
Authors: Mizanur Rahman Jewel, Mohamed Elmahallawy, Sanjay Madria, Samuel Frimpong,
Abstract summary: Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems.<n>We propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes.
Score: 0.6533091401094101
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

Related papers

Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding [4.918510966192794]
We present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding.<n>We focus on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD.<n>The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach.
arXiv Detail & Related papers (2026-03-01T23:50:08Z)
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z)
Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining [95.00432497331583]
Multi-Prior Hierarchical Mamba (MPHM) network for image deraining.<n>MPHM integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information.<n>Experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset.
arXiv Detail & Related papers (2025-11-17T08:08:59Z)
Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z)
RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation [26.836547579041067]
Referring ImageHide (RIS) aims to segment specific objects based on natural language descriptions.<n>Existing datasets and methods are typically designed for high-altitude and static-view imagery.<n>We present RIS-LAD, the first fine-grained RIS benchmark tailored for Low-Altitude Drone (LAD) scenarios.
arXiv Detail & Related papers (2025-07-28T15:21:03Z)
MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing [39.47126465689941]
We present MONITRS, a novel dataset of more than 10,000 FEMA disaster events with temporal satellite imagery and natural language annotations from news articles.<n>We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks.
arXiv Detail & Related papers (2025-07-22T04:59:09Z)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather [48.65180004211851]
We present a robust monocular depth estimation method called textbfACDepth from the perspective of high-quality training data generation and domain adaptation.<n>Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training.<n>We elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2.
arXiv Detail & Related papers (2025-05-18T02:30:47Z)
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding [44.81427860963744]
A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions.<n>We propose DenseGrounding, a novel approach designed to enhance both visual and textual semantics.<n>For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features.<n>For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions.
arXiv Detail & Related papers (2025-05-08T05:49:06Z)
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments [29.107550321162122]
We present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced vision-language models.<n>Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly.<n>Our experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs.
arXiv Detail & Related papers (2024-11-28T05:55:13Z)
Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding [54.50661247353241]
3D visual grounding consists of identifying the instance in a 3D scene which is referred by an accompanying language description. Most methods rely on a basic supervised cross-entropy loss on the predicted distribution over candidate instances. We introduce two novel losses for 3D visual grounding: a visual-level offset loss on regressed vector offsets from each instance to the ground-truth referred instance and a language-related span loss on predictions for the word-level span of the referred instance in the description.
arXiv Detail & Related papers (2024-11-05T18:39:25Z)
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.