Fugu-MT 論文翻訳(概要): Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

論文の概要: Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

arxiv url: http://arxiv.org/abs/2605.07178v1
Date: Fri, 08 May 2026 03:16:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.770653
Title: Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection
Title（参考訳）: マスクが話せる:リモートセンシング変化検出のための単一モード画像から構造化テキスト情報を抽出する
Authors: Kai Zheng, Hang-Cheng Dong, Jiatong Pan, Zhenkai Wu, Fupeng Wei, Wei Zhang,
Abstract要約: 変更ラベルから直接構造化されたテキスト特徴を取得するフレームワークであるS2Mを提案する。 S2Mは17.80%のSekとF$_scd$の66.14%を達成する。
参考スコア（独自算出の注目度）: 5.090262478249704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.
Abstract（参考訳）: リモートセンシングによる変化検出は、都市モニタリング、災害評価、環境資源管理において重要である。しかし、一助深層学習法は、視覚的に類似しているが無関係なバリエーションで真の意味変化を混乱させることが多い。最近のマルチモーダルアプローチでは、テキストを補助的監視として取り入れているが、それらの記述は意味的に粗く、構造化されていないか、あるいはモデル生成されうるためノイズが多い。きめ細かい変更セマンティクスは、すべての変更検出データセットで標準となる地味なマスクラベルに暗黙的にエンコードされています。これらのマスクは、変化がどこで起きたか、土地被覆のタイプが前と後、どのように移行したか、どれだけのオブジェクトが関与したかを知っている。本稿では,変更ラベルから直接構造化されたテキスト特徴を付加アノテーションコストゼロで獲得するフレームワークであるS2Mを提案する。具体的には、各変更領域は、自動的に意味的な四重項(何、何、何、何、何、何、何)に書き起こされ、いくつかの固定テンプレートのテキスト記述に変換され、正確で、密度が高く、ノイズのないマルチモーダル監視を提供する。両方向のコントラスト損失を有するマルチモーダルデコーダを導入し、視覚特徴と構造化テキスト埋め込みとの深い整合性を実現する。提案手法を検証するため,ガザストリップに関する新しいマルチクラス変化検出(MCD)データセットであるGaza-Change-v2を構築した。このMCDデータセットでは、S2Mは17.80\%のSekとF$_{\text{scd}}$の66.14\%を達成する。私たちの研究は、マスクが本当に話せることを示しています。彼らは、何、どこで、どのように、どのくらいの変更があったか、正確に教えてくれます。

論文の概要: Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

関連論文リスト