Fugu-MT 論文翻訳(概要): ProductWebGen: Benchmarking Multimodal Product Webpage Generation

論文の概要: ProductWebGen: Benchmarking Multimodal Product Webpage Generation

arxiv url: http://arxiv.org/abs/2606.01022v1
Date: Sun, 31 May 2026 05:25:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.081056
Title: ProductWebGen: Benchmarking Multimodal Product Webpage Generation
Title（参考訳）: ProductWebGen: マルチモーダルな製品Webページ生成のベンチマーク
Authors: Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng,
Abstract要約: 本稿ではProductWebGenを紹介し,高度なマルチモーダル生成モデルの製品Webページ生成能力のベンチマークを行う。 ProductWebGenは、13の製品カテゴリをカバーする500のテストサンプルで、ソースイメージ、ビジュアルコンテンツインストラクション、Webページインストラクションで構成されています。課題は、ソース画像と指示に従って、複数の一貫性のある画像を含む製品ショーWebページを生成することである。
参考スコア（独自算出の注目度）: 38.39574522096441
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.
Abstract（参考訳）: 製品表示のWebページをソース製品イメージから作成し、レイアウトやビジュアルなコンテンツ指示とともに、マーケティング、広告、Eコマースといった分野にとって重要な実践的価値を持つ。直感的には、このタスクは製品ディスプレイ全体にわたる厳密な視覚的一貫性と、レンダリング可能なHTMLコードを共同で生成するための高忠実度命令を要求する。制御性と命令追従性に関するこれらの要件は、画像編集モデルや統一モデルなど、高度なマルチモーダル生成モデルのコア機能と密接に一致している。本稿では,これらのモデルの製品Webページ生成能力を体系的にベンチマークするProductWebGenを紹介する。 ProductWebGenは、13の製品カテゴリをカバーする500のテストサンプルで、ソースイメージ、ビジュアルコンテンツインストラクション、Webページインストラクションで構成されています。課題は、ソース画像と指示に従って、複数の一貫性のある画像を含む製品ショーWebページを生成することである。 1つは、大きな言語モデルと画像編集モデルを使用して、HTMLコードと画像(編集ベース)を別々に生成し、もう1つは、前のマルチモーダルコンテキスト(UMベース)で画像を生成するために単一のUMに依存している。実験結果から,ウェブページのインストラクションの追従やコンテンツアピールにおいて,編集によるアプローチが先行することを示す一方,UMベースのアプローチは視覚的コンテンツインストラクションの達成において,より有利な結果をもたらす可能性が示唆された。また、1000の製品イメージとLLM生成HTMLコードからなる教師付き微調整データセットProductWebGen-1kを構築した。オープンソースUM BAGELの有効性を検証する。データとコードはhttps://github.com/SJTU-DENG-Lab/ProductWebGenで入手できる。

論文の概要: ProductWebGen: Benchmarking Multimodal Product Webpage Generation

関連論文リスト