Fugu-MT 論文翻訳(概要): D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

論文の概要: D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

arxiv url: http://arxiv.org/abs/2505.24372v1
Date: Fri, 30 May 2025 09:04:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-02 19:47:52.867333
Title: D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding
Title（参考訳）: D2AF:ビジュアルグラウンドのためのデュアル駆動アノテーションとフィルタリングフレームワーク
Authors: Yichi Zhang, Gongwei Chen, Jun Zhu, Jia Wan,
Abstract要約: D2AFは、入力画像のみを使用して視覚的な接地を行うための堅牢なアノテーションフレームワークである。二重駆動型アノテーション戦略を実装することにより、詳細な領域テキストペアを効果的に生成する。以上の結果から,データ量の増加がモデル性能を向上させることが示唆された。
参考スコア（独自算出の注目度）: 36.321156992727055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Grounding is a task that aims to localize a target region in an image based on a free-form natural language description. With the rise of Transformer architectures, there is an increasing need for larger datasets to boost performance. However, the high cost of manual annotation poses a challenge, hindering the scale of data and the ability of large models to enhance their effectiveness. Previous pseudo label generation methods heavily rely on human-labeled captions of the original dataset, limiting scalability and diversity. To address this, we propose D2AF, a robust annotation framework for visual grounding using only input images. This approach overcomes dataset size limitations and enriches both the quantity and diversity of referring expressions. Our approach leverages multimodal large models and object detection models. By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs using both closed-set and open-set approaches. We further conduct an in-depth analysis of data quantity and data distribution. Our findings demonstrate that increasing data volume enhances model performance. However, the degree of improvement depends on how well the pseudo labels broaden the original data distribution. Based on these insights, we propose a consistency and distribution aware filtering method to further improve data quality by effectively removing erroneous and redundant data. This approach effectively eliminates noisy data, leading to improved performance. Experiments on three visual grounding tasks demonstrate that our method significantly improves the performance of existing models and achieves state-of-the-art results.
Abstract（参考訳）: ビジュアルグラウンド(Visual Grounding)は、画像中の対象領域を、自由形式の自然言語記述に基づいてローカライズすることを目的としたタスクである。 Transformerアーキテクチャの台頭により、パフォーマンスを向上させるためにより大きなデータセットの必要性が高まっている。しかし、手動アノテーションの高コストは、データのスケールと大きなモデルの有効性を高める能力の妨げとなる。従来の擬似ラベル生成手法は、拡張性と多様性を制限するために、オリジナルのデータセットの人間ラベル付きキャプションに大きく依存していた。そこで本稿では,入力画像のみを用いた視覚的接地のためのロバストなアノテーションフレームワークであるD2AFを提案する。このアプローチはデータセットのサイズ制限を克服し、参照表現の量と多様性の両方を豊かにする。提案手法は,マルチモーダル大モデルとオブジェクト検出モデルを利用する。二重駆動型アノテーション戦略を実装することにより、クローズドセットとオープンセットの両方のアプローチを用いて、詳細な領域テキストペアを効果的に生成する。さらに,データ量とデータ分布の詳細な分析を行う。以上の結果から,データ量の増加がモデル性能を向上させることが示唆された。しかし、改善の度合いは、擬似ラベルがいかに元のデータ分布を拡大するかに依存する。これらの知見に基づいて,不正データや冗長データを効果的に除去することにより,データ品質をさらに向上するための一貫性と分布を考慮したフィルタリング手法を提案する。このアプローチは、ノイズの多いデータを効果的に排除し、パフォーマンスが向上する。 3つの視覚的接地課題に対する実験により,本手法は既存モデルの性能を大幅に向上し,最先端の結果が得られた。

論文の概要: D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding

関連論文リスト