Fugu-MT 論文翻訳(概要): JourneyDB: A Benchmark for Generative Image Understanding

論文の概要: JourneyDB: A Benchmark for Generative Image Understanding

arxiv url: http://arxiv.org/abs/2307.00716v1
Date: Mon, 3 Jul 2023 02:39:08 GMT
ステータス: 翻訳完了
システム内更新日: 2023-07-05 14:39:19.628832
Title: JourneyDB: A Benchmark for Generative Image Understanding
Title（参考訳）: JourneyDB: 生成イメージ理解のためのベンチマーク
Authors: Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Hongsheng Li
Abstract要約: 生成画像におけるマルチモーダル視覚理解のための大規模データセットであるJourneyDBを提案する。私たちのキュレートされたデータセットは、400万の多彩で高品質な生成画像と、それらを生成するために使用するテキストプロンプトをペアリングします。
参考スコア（独自算出の注目度）: 59.30706906338838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend. To this end, we present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. Our curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning and visual question answering. Lastly, we assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope the proposed dataset and benchmarks will facilitate the research in the field of generative content understanding. The dataset will be available on https://journeydb.github.io.
Abstract（参考訳）: 近年の視覚言語モデルの進歩はマルチモーダル理解に革命をもたらしたが、それらが生成した画像を理解する能力を持っているかどうかは不明である。実データと比較すると、合成画像は内容とスタイルの両面で高い多様性を示しており、モデルが完全に理解する上で重大な困難がある。そこで本研究では,生成画像におけるマルチモーダル視覚理解のための大規模データセットである journeydb を提案する。当社のキュレーションデータセットは,400万の多様で高品質な生成画像と,それら生成に使用するテキストプロンプトを組み合わせることでカバーしています。さらに,コンテントとスタイル解釈の両方の観点から生成画像理解の性能を定量化するベンチマークを4つ設計した。これらのベンチマークには、プロンプトインバージョン、スタイル検索、画像キャプション、視覚的質問応答が含まれる。最後に、journeydbに適用される現在の最先端のマルチモーダルモデルの性能を評価し、その強みとコンテンツ理解の限界を詳細に分析する。提案したデータセットとベンチマークにより、生成コンテンツ理解の分野での研究が促進されることを期待する。データセットはhttps://journeydb.github.ioで入手できる。

関連論文リスト

Entity Image and Mixed-Modal Image Retrieval Datasets [9.6977953463099]
本稿では,画像の検索を厳格に評価するための新しいベンチマークを提案する。 We present two new datasets: the Entity Image dataset (EI), including canonical image for Wikipedia entity and the Mixed-Modal Image Retrieval dataset (MMIR), from the WIT dataset。我々は,学習コーパスと混合モーダル検索のための評価セットとして,ベンチマークの有用性を実証的に検証した。
論文参考訳（メタデータ） (2025-06-02T22:04:06Z)
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
我々は、インターリーブされたテキスト・画像生成のための総合的な評価フレームワークISGを提案する。 ISGは、全体性、構造性、ブロックレベル、画像固有性の4つのレベルで反応を評価する。 ISGと組み合わせて、ISG-Benchというベンチマークを導入し、8つのカテゴリと21のサブカテゴリにわたる1,150のサンプルを網羅した。
論文参考訳（メタデータ） (2024-11-26T07:55:57Z)
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
本研究では、現在のAIモデルがマルチモーダルな構造化データに基づいて知識を考慮した推論を行うことができるかどうかを検討する。この目的のために設計された新しいデータセットであるMMTabQAを紹介する。我々の実験は、複数のテキストと画像の入力を効果的に統合し解釈する上で、現在のAIモデルに対する重大な課題を浮き彫りにしている。
論文参考訳（メタデータ） (2024-08-25T15:17:43Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) は、知識誘導による視覚属性の操作による新しいマルチモーダルデータ拡張手法である。 ARMADAは、新しいマルチモーダルデータ生成フレームワークである。 (i) 意味的に一貫性があるがユニークな画像-テキストペア生成のために、シンボリックKBから知識基底属性を抽出する。これはまた、解釈可能性の向上と現実世界の接地のために外部の知識プロキシを活用する必要性を強調している。
論文参考訳（メタデータ） (2024-08-19T15:27:25Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINSはText-RichイメージINStructionデータセットである。 39,153の画像、キャプション、102,437の質問が含まれている。本稿では,画像中のテキスト内容の理解に長けたLanguage-vision Reading Assistant(LaRA)を提案する。
論文参考訳（メタデータ） (2024-06-10T18:52:37Z)
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese [1.6340299456362617]
ベトナムにおける様々な視覚的推論能力を評価するための先駆的な収集であるViCLEVRデータセットを紹介した。我々は、現代の視覚的推論システムの包括的な分析を行い、その強みと限界についての貴重な洞察を提供する。 PhoVITは、質問に基づいて画像中のオブジェクトを識別する総合的なマルチモーダル融合である。
論文参考訳（メタデータ） (2023-10-27T10:44:50Z)
EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
textbfEntity-textbfDriven textbfImage textbfSearch (EDIS)は、ニュース領域におけるクロスモーダル画像検索のためのデータセットである。 EDISは、実際の検索エンジンの結果から100万のWebイメージとキュレートされたデータセットで構成され、各イメージはテキスト記述と組み合わせられている。
論文参考訳（メタデータ） (2023-05-23T02:59:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。