FuguReport

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Authors Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Affiliations China Telecom / University of Science and Technology of China / Northwestern Polytechnical University
Categories Task / Referring Segmentation / Multi-temporal image segmentation with language, Method / Baseline Models / LVLM-based baseline evaluation, Evaluation / Benchmarking / Multi-temporal referring segmentation benchmark
License CC BY 4.0

Abstract Overview

This paper introduces Multi-temporal Referring Segmentation (MTRS), a task in which a model receives temporally related images and a natural-language query and must segment the region corresponding to the described temporal change. To support this setting, the authors build MTRefSeg-21K using the CRAFT-Agent pipeline with human auditing, yielding 9,521 bi-temporal image pairs and 20,924 referring expressions with masks across normal-scene and remote-sensing domains. The paper also adapts existing VLM/LVLM segmentation models to this setting and shows that direct use of single-temporal LVLMs is generally ineffective for MTRS. To address the gap, the authors propose MTRefSeg-R1, a change-aware LVLM trained in two stages: vision-only temporal-change pretraining on about 20K bi-temporal samples, followed by language-guided fine-tuning on MTRefSeg-21K.

Novelty

The main novelty is the formalization of MTRS as a new task that combines temporal correspondence reasoning, language grounding, and pixel-level segmentation. The work also contributes the first large-scale benchmark for this task, MTRefSeg-21K, and a change-aware LVLM baseline, MTRefSeg-R1, built around explicit temporal fusion and two-stage training.

Results

Experiments show that direct inference with existing single-temporal LVLMs performs poorly, while fine-tuning on MTRefSeg-21K substantially improves results but still trails the proposed model. MTRefSeg-R1 achieves the best reported mean performance across the main settings, with 65.68 mIoU and 71.65 Pr@50 on the averaged benchmark, and reaches 68.24 mIoU on Train→Val and 68.92 mIoU on the RS→RS setting. Ablations further indicate that stage-1 visual change pretraining and full fine-tuning improve performance over end-to-end multimodal pretraining or LoRA-based adaptation.

Key Points

  1. MTRefSeg-21K provides a multi-domain MTRS benchmark with 9,521 image pairs and 20,924 language-grounded change masks spanning normal-scene and remote-sensing imagery.
  2. Benchmarking shows that single-temporal LVLMs transfer poorly to language-guided temporal change segmentation without task-specific adaptation.
  3. MTRefSeg-R1 combines explicit change-aware temporal fusion with two-stage training and delivers the strongest reported LVLM-based performance across overall and remote-sensing evaluations.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.