TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
- URL: http://arxiv.org/abs/2506.18862v2
- Date: Fri, 26 Sep 2025 17:35:39 GMT
- Title: TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
- Authors: Zhongbin Guo, Yuhao Wang, Ping Jian, Chengzhi Li, Xinyue Chen, Zhen Yang, Ertai E,
- Abstract summary: We introduce TAMMs, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture.<n>TAMMs introduces two key innovations: Temporal Adaptation Modules (TAM) enhance frozen MLLM's ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (SFCI) mechanism translates this change understanding into fine-grained generative control.<n>Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks.
- Score: 22.01157165112828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical, yet historically disjointed tasks in Satellite Image Time Series (SITS) analysis. Both are fundamentally limited by the common challenge of modeling long-range temporal dynamics. To explore how to improve the performance of methods on both tasks simultaneously by enhancing long-range temporal understanding capabilities, we introduce TAMMs, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture. TAMMs introduces two key innovations: Temporal Adaptation Modules (TAM) enhance frozen MLLM's ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (SFCI) mechanism translates this change understanding into fine-grained generative control. This synergistic design makes the understanding from the TCD task to directly inform and improve the consistency of the FSIF task. Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks.
Related papers
- Temporal Consistency-Aware Text-to-Motion Generation [41.71400323450202]
We propose TCA-T2M, a framework for temporal consistency-aware T2M generation.<n>Our approach introduces a temporal consistency-aware spatial VQ-VAE for cross-sequence temporal alignment.<n> Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-20T08:17:01Z) - UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z) - TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning [25.848638804759872]
Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis.<n>We present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension.
arXiv Detail & Related papers (2025-12-03T16:57:00Z) - FAIM: Frequency-Aware Interactive Mamba for Time Series Classification [87.84511960413715]
Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition.<n>We propose FAIM, a lightweight Frequency-Aware Interactive Mamba model.<n>We show that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2025-11-26T08:36:33Z) - MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection [94.12444452690329]
This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities.<n>MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
arXiv Detail & Related papers (2025-11-22T06:04:29Z) - Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z) - ME-TST+: Micro-expression Analysis via Temporal State Transition with ROI Relationship Awareness [12.584801819076425]
Micro-expressions (MEs) are regarded as important indicators of an individual's intrinsic emotions, preferences, and tendencies.<n>Previous deep learning approaches commonly employ sliding-window classification networks.<n>This paper proposes two state space model-based architectures, namely ME-TST and ME-TST+.
arXiv Detail & Related papers (2025-08-11T15:28:32Z) - Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting [12.591771385493509]
We present textST-VFM, a framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose robustness-temporal forecasting.<n>The framework integrates raw inputs with auxiliary ST flow, where the flow encodes lightweight temporal difference signals interpretable as dynamic cues.<n>The emphpre-VFM reprogramming applies a Temporal-Aware Token to align both branches into VFM-compatible feature spaces.<n>The emphpost-VFM reprogramming introduces a Bilateral CrossPrompt Coordination module, enabling dynamic interaction between branches.
arXiv Detail & Related papers (2025-07-14T08:33:34Z) - DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs [5.074812070492738]
We introduce DaMO, a data-efficient Video LLM specifically designed for accurate temporal reasoning and multimodal understanding.<n>We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities.<n>Our work establishes a promising direction for data-efficient video-language modeling.
arXiv Detail & Related papers (2025-06-13T08:13:05Z) - FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z) - DG-STMTL: A Novel Graph Convolutional Network for Multi-Task Spatio-Temporal Traffic Forecasting [0.0]
Key challenge to accurate prediction is how to model the complex-temporal dependencies and adapt to the inherent dynamics in data.<n>Traditional Graph Contemporal Networks (GCNs) often struggle with static adjacency matrices that introduce bias or learnable patterns.<n>This study introduces a novel MTL framework, Dynamic Group-wise S-temporal Multi-Temporal Learning (DGS-TLTM)
arXiv Detail & Related papers (2025-04-10T15:00:20Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics [56.99021951927683]
Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring.<n>Existing Large Language Models (LLMs) usually perform suboptimally because they neglect the inherent characteristics of time series data.<n>We propose LLM-PS to empower the LLM for TSF by learning the fundamental textitPatterns and meaningful textitSemantics from time series data.
arXiv Detail & Related papers (2025-03-12T11:45:11Z) - Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding [23.477954901326978]
Existing approaches adopt either implicit temporal modeling, relying solely on the decoder, or explicit temporal modeling, employing auxiliary temporal encoders.<n>We propose the explicit Temporal (STE) to enable flexible explicit temporal modeling with adjustable receptive temporal fields and token compression ratios.<n>Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
arXiv Detail & Related papers (2025-01-28T08:30:58Z) - Multimodal Large Models Are Effective Action Anticipators [10.454791411515812]
ActionLLM is a novel approach that treats video sequences as successive tokens, leveraging Large Language Models to anticipate future actions.<n>Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer.<n>To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding.
arXiv Detail & Related papers (2025-01-01T10:16:10Z) - Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities.<n>We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations.<n>Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z) - SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation [92.73405185996315]
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation.<n>Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering.<n>We introduce a model-agnostic iterative self-feedback framework (SILMM) that can enable LMMs to provide helpful and scalable self-improvement and optimize text-image alignment.
arXiv Detail & Related papers (2024-12-08T05:28:08Z) - Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models [33.37379526356273]
We introduce a novel learning paradigm termed MLLM4WTAL.<n>It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors.<n>It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Multi-Patch Prediction: Adapting LLMs for Time Series Representation
Learning [22.28251586213348]
aLLM4TS is an innovative framework that adapts Large Language Models (LLMs) for time-series representation learning.
A distinctive element of our framework is the patch-wise decoding layer, which departs from previous methods reliant on sequence-level decoding.
arXiv Detail & Related papers (2024-02-07T13:51:26Z) - Making LLaMA SEE and Draw with SEED Tokenizer [69.1083058794092]
We introduce SEED, an elaborate image tokenizer that empowers Large Language Models with the ability to SEE and Draw.
With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe.
SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation.
arXiv Detail & Related papers (2023-10-02T14:03:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.