Fugu-MT 論文翻訳(概要): D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models

論文の概要: D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models

arxiv url: http://arxiv.org/abs/2512.15747v1
Date: Wed, 10 Dec 2025 20:41:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-19 18:10:31.653995
Title: D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models
Title（参考訳）: D3G: マルチモーダルモデルにおけるゼロショット画像分類精度の向上
Authors: Javon Hickmon,
Abstract要約: 本稿では,事前学習型マルチモーダルモデルにおける階層バイアスを低減しつつ,分類精度を高める訓練自由ゼロショット手法を提案する。推定時に多様な人口統計データを提供することで、これらのモデルの性能が向上することを示し、その結果の精度指標に対する個々の人口統計の影響を探索する。
参考スコア（独自算出の注目度）: 4.56877715768796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
Abstract（参考訳）: 画像分類は、人間レベルの画像理解を達成するために、機械認識に不可欠な課題である。 CLIPのようなマルチモーダルモデルは、視覚と言語間のセマンティックな類似性を学習することで、このタスクでうまく機能するが、これらの進歩にもかかわらず、画像分類は依然として難しい課題である。キャパシティの低いモデルは、しばしば不適合に苦しむため、きめ細かい画像分類では性能が劣る。これに加えて、各クラスのリッチなクロスモーダル表現による高品質なデータを保証することが重要である。データセットがバランスの取れた人口統計を強制しない場合、予測はより表現されたクラスに偏り、他は無視される。我々は、これらの問題がゼロショット画像分類の有害バイアスにどのように結びつくかに注目し、人口統計学的バイアスにおいてこれらの問題にどのように対処するかを探る。 D3G(Diverse Demographic Data Generation)は、事前学習したマルチモーダルモデルにおける人口統計バイアスを低減しつつ、分類精度を向上する訓練不要ゼロショット手法である。本手法では,CLIPを基本マルチモーダルモデルとし,安定拡散XLを生成モデルとする。推定時に多様な人口統計データを提供することで、これらのモデルの性能が向上することを示し、その結果の精度指標に対する個々の人口統計の影響を探索する。

論文の概要: D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models

関連論文リスト