Related papers: TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

URL: http://arxiv.org/abs/2506.18862v1
Date: Mon, 23 Jun 2025 17:26:16 GMT
Title: TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Authors: Zhongbin Guo, Yuhao Wang, Ping Jian, Xinyue Chen, Wei Peng, Ertai E,
Abstract summary: We study the capabilities of multimodal large language models (MLLMs) on a novel task that jointly targets temporal change understanding and future scene generation.<n>We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image understanding and forecasting.
Score: 8.914172086217185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.

Related papers

Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting [12.591771385493509]
We present textST-VFM, a framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose robustness-temporal forecasting.<n>The framework integrates raw inputs with auxiliary ST flow, where the flow encodes lightweight temporal difference signals interpretable as dynamic cues.<n>The emphpre-VFM reprogramming applies a Temporal-Aware Token to align both branches into VFM-compatible feature spaces.<n>The emphpost-VFM reprogramming introduces a Bilateral CrossPrompt Coordination module, enabling dynamic interaction between branches.
arXiv Detail & Related papers (2025-07-14T08:33:34Z)
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs [5.074812070492738]
We introduce DaMO, a data-efficient Video LLM specifically designed for accurate temporal reasoning and multimodal understanding.<n>We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities.<n>Our work establishes a promising direction for data-efficient video-language modeling.
arXiv Detail & Related papers (2025-06-13T08:13:05Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics [56.99021951927683]
Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring.<n>Existing Large Language Models (LLMs) usually perform suboptimally because they neglect the inherent characteristics of time series data.<n>We propose LLM-PS to empower the LLM for TSF by learning the fundamental textitPatterns and meaningful textitSemantics from time series data.
arXiv Detail & Related papers (2025-03-12T11:45:11Z)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding [23.477954901326978]
Existing approaches adopt either implicit temporal modeling, relying solely on the decoder, or explicit temporal modeling, employing auxiliary temporal encoders.<n>We propose the explicit Temporal (STE) to enable flexible explicit temporal modeling with adjustable receptive temporal fields and token compression ratios.<n>Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
arXiv Detail & Related papers (2025-01-28T08:30:58Z)
Multimodal Large Models Are Effective Action Anticipators [10.454791411515812]
ActionLLM is a novel approach that treats video sequences as successive tokens, leveraging Large Language Models to anticipate future actions.<n>Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer.<n>To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding.
arXiv Detail & Related papers (2025-01-01T10:16:10Z)
Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities.<n>We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations.<n>Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation [92.73405185996315]
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation.<n>Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering.<n>We introduce a model-agnostic iterative self-feedback framework (SILMM) that can enable LMMs to provide helpful and scalable self-improvement and optimize text-image alignment.
arXiv Detail & Related papers (2024-12-08T05:28:08Z)
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models [33.37379526356273]
We introduce a novel learning paradigm termed MLLM4WTAL.<n>It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors.<n>It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z)
Multi-Patch Prediction: Adapting LLMs for Time Series Representation Learning [22.28251586213348]
aLLM4TS is an innovative framework that adapts Large Language Models (LLMs) for time-series representation learning. A distinctive element of our framework is the patch-wise decoding layer, which departs from previous methods reliant on sequence-level decoding.
arXiv Detail & Related papers (2024-02-07T13:51:26Z)
Making LLaMA SEE and Draw with SEED Tokenizer [69.1083058794092]
We introduce SEED, an elaborate image tokenizer that empowers Large Language Models with the ability to SEE and Draw. With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe. SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation.
arXiv Detail & Related papers (2023-10-02T14:03:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.