Fugu-MT 論文翻訳(概要): Grounding Synthetic Data Generation With Vision and Language Models

論文の概要: Grounding Synthetic Data Generation With Vision and Language Models

arxiv url: http://arxiv.org/abs/2603.09625v1
Date: Tue, 10 Mar 2026 13:03:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.323945
Title: Grounding Synthetic Data Generation With Vision and Language Models
Title（参考訳）: 視覚と言語モデルを用いた接地合成データ生成
Authors: Ümit Mert Çağlar, Alptekin Temizel,
Abstract要約: 本稿では,リモートセンシングにおける合成データ拡張と評価を解釈可能な視覚言語基盤フレームワークを提案する。このフレームワークに基づいて、ARAS400k: セグメント化とキャプションのための合成データで拡張された大規模リモートセンシングデータセット。 ARAS400kは、意味合成を分析し、キャプションの冗長性を最小化し、視覚構造と言語記述間の相互整合性を検証することによって、合成データの自動評価を可能にする。
参考スコア（独自算出の注目度）: 4.554894288663752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.
Abstract（参考訳）: ディープラーニングモデルは、データの多様性とボリュームの増加、既存のデータセットを改善するための合成データ拡張の動機付けの恩恵を受ける。しかし、既存の合成データの評価基準は、典型的には潜時的特徴類似性を計算するが、これは解釈が困難であり、下流タスクへの貢献と必ずしも相関しない。本稿では,リモートセンシングにおける合成データ拡張と評価を解釈可能な視覚言語基盤フレームワークを提案する。提案手法は, 生成モデル, セマンティックセグメンテーション, 画像キャプションと視覚モデルと言語モデルを組み合わせる。セグメント化とキャプションのための合成データで拡張された大規模なリモートセンシングデータセットで、100万個の実画像と300万個の合成画像を含み、それぞれにセグメント化マップと記述が組み合わされている。 ARAS400kは、意味合成を分析し、キャプションの冗長性を最小化し、視覚構造と言語記述間の相互整合性を検証することによって、合成データの自動評価を可能にする。実験結果から,合成データのみを訓練したモデルは競争性能レベルに達するが,拡張データ(実画像と合成画像の組み合わせ)を訓練したモデルは,常に実データベースラインを上回っていることがわかった。これにより、リモートセンシングタスク、特にセマンティックセグメンテーションや画像キャプションにおいて、スケーラブルなベンチマークが確立される。データセットはzenodo.org/records/18890661で、コードベースはgithub.com/caglarmert/ARAS400kで入手できる。

論文の概要: Grounding Synthetic Data Generation With Vision and Language Models

関連論文リスト