Fugu-MT 論文翻訳(概要): WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

論文の概要: WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

arxiv url: http://arxiv.org/abs/2511.11434v1
Date: Fri, 14 Nov 2025 16:02:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-17 22:42:18.70549
Title: WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Title（参考訳）: WEAVE: コンテキスト内インターリーブの理解と生成の解放とベンチマーク
Authors: Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua,
Abstract要約: We present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation。 WeAVE-100kは、370Kのダイアログターンと500Kイメージにまたがる100Kのインターリーブサンプルの大規模なデータセットである。 WeAVEBenchは480の画像に基づいた100のタスクを備えた人手によるベンチマークである。
参考スコア（独自算出の注目度）: 98.47375190901447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
Abstract（参考訳）: 統合マルチモーダルモデル(UMM)の最近の進歩は、視覚的理解と生成の著しい進歩を可能にしている。しかし、既存のデータセットとベンチマークは、主にシングルターンインタラクションに焦点を当てており、実世界の画像作成と編集のマルチターン、コンテキスト依存の性質を捉えていない。このギャップに対処するため、本論文では、コンテキスト内インターリーブによる相互モダリティ理解と生成のための最初のスイートであるWAEAVEを紹介する。私たちのスイートは2つの補完部分で構成されています。 WEAVE-100kは、370Kの対話ターンと500Kイメージにまたがる100Kのインターリーブされた大規模なデータセットである。 WEAVEBenchは、480のイメージに基づく100のタスクからなる人為的注釈付きベンチマークであり、参照画像と元のイメージの組み合わせをベースとしたハイブリッドVLM判定フレームワークと、多ターン生成、ビジュアルメモリ、およびさまざまな領域にわたる世界知識推論におけるモデルの能力を評価する編集命令を特徴とする。 WEAVE-100kでのトレーニングにより、視覚理解、画像編集、理解世代コラボレーション機能を実現することが実証された。さらに、UMMが創発的な視覚記憶機能を開発するのを促進する一方で、WAEAVEBenchに関する広範な評価は、マルチターン、コンテキスト対応の画像生成および編集における現在のアプローチの永続的な制限と課題を明らかにしている。 We believe WEAVE provides a view and foundation for study in-context interleaved comprehension and generation for multi-modal community。

論文の概要: WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

関連論文リスト