Fugu-MT 論文翻訳(概要): Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

論文の概要: Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

arxiv url: http://arxiv.org/abs/2602.09775v1
Date: Tue, 10 Feb 2026 13:36:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-11 20:17:43.549241
Title: Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets
Title（参考訳）: 画像はどこから来るのか? キャプションの分析から地理的なデータセットへ
Authors: Abhipsa Basu, Yugam Bahl, Kirti Bhagat, Preethi Seshadri, R. Venkatesh Babu, Danish Pruthi,
Abstract要約: LLMを用いてキャプションから抽出した位置情報に基づいて、画像キャプチャペアを国にマッピングすることで、大規模マルチモーダルデータセットを地理的にプロファイリングする。アメリカ合衆国、イギリス、カナダはそれぞれ48.0%のサンプルを保有しており、南アメリカ、アフリカ諸国はそれぞれ1.8%のイメージと3.8%のイメージしか表現されていない。
参考スコア（独自算出の注目度）: 33.86868726260716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0\%$ of samples, while South American and African countries are severely under-represented with only $1.8\%$ and $3.8\%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
Abstract（参考訳）: 近年の研究では、テキスト・ツー・イメージのモデルは、しばしば地理的に代表される画像の生成に失敗し、トレーニングデータの表現性への懸念を提起し、その疑問を提起している。 LLMを用いてキャプションから抽出した位置情報に基づいて、画像キャプチャペアを国にマッピングすることで、大規模マルチモーダルデータセットを地理的にプロファイリングする。広く使われている3つのデータセット(Re-LAION、DataComp1B、Conceptual Captions)から20ドルの共通のエンティティ(例えば、家、旗など)を対象に英語のキャプションを調べたところ、米国、英国、カナダが48.0.%のサンプルを保有しており、南アメリカ、アフリカ諸国はそれぞれ1.8.%のイメージと3.8.%のイメージしか表現されていない。我々は、国のGDPとデータの表現(ρ=0.82ドル)との間に強い相関関係を観察する。 Re-LAIONデータセットから4ドル(約4,400円)の言語で非英語のサブセットを調べると、表現はこれらの言語が主に話されている国に大きく傾いていることが分かる。さらに、高い表現が必ずしもより視覚的あるいは意味的な多様性に変換されないこともわかりました。最後に、Re-LAIONで訓練された安定拡散v1.3で生成された国固有の画像を分析し、世代が現実的に見える一方で、実際の画像と比較して、その範囲が著しく制限されていることを示す。

論文の概要: Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

関連論文リスト