An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
Abstract Overview
This paper introduces Multi-temporal Referring Segmentation (MTRS), a task in which a model receives temporally related images and a natural-language query and must segment the region corresponding to the described temporal change. To support this setting, the authors build MTRefSeg-21K using the CRAFT-Agent pipeline with human auditing, yielding 9,521 bi-temporal image pairs and 20,924 referring expressions with masks across normal-scene and remote-sensing domains. The paper also adapts existing VLM/LVLM segmentation models to this setting and shows that direct use of single-temporal LVLMs is generally ineffective for MTRS. To address the gap, the authors propose MTRefSeg-R1, a change-aware LVLM trained in two stages: vision-only temporal-change pretraining on about 20K bi-temporal samples, followed by language-guided fine-tuning on MTRefSeg-21K.
Novelty
The main novelty is the formalization of MTRS as a new task that combines temporal correspondence reasoning, language grounding, and pixel-level segmentation. The work also contributes the first large-scale benchmark for this task, MTRefSeg-21K, and a change-aware LVLM baseline, MTRefSeg-R1, built around explicit temporal fusion and two-stage training.
Results
Experiments show that direct inference with existing single-temporal LVLMs performs poorly, while fine-tuning on MTRefSeg-21K substantially improves results but still trails the proposed model. MTRefSeg-R1 achieves the best reported mean performance across the main settings, with 65.68 mIoU and 71.65 Pr@50 on the averaged benchmark, and reaches 68.24 mIoU on Train→Val and 68.92 mIoU on the RS→RS setting. Ablations further indicate that stage-1 visual change pretraining and full fine-tuning improve performance over end-to-end multimodal pretraining or LoRA-based adaptation.
Key Points
- MTRefSeg-21K provides a multi-domain MTRS benchmark with 9,521 image pairs and 20,924 language-grounded change masks spanning normal-scene and remote-sensing imagery.
- Benchmarking shows that single-temporal LVLMs transfer poorly to language-guided temporal change segmentation without task-specific adaptation.
- MTRefSeg-R1 combines explicit change-aware temporal fusion with two-stage training and delivers the strongest reported LVLM-based performance across overall and remote-sensing evaluations.