Related papers: A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

URL: http://arxiv.org/abs/2305.03668v2
Date: Fri, 20 Oct 2023 13:18:06 GMT
Title: A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding
Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo
Abstract summary: We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
Score: 66.6468787004067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Extensive experiments show that the new data in WikiWeb2M improves task performance compared to prior work.

Related papers

Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs) These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
Dual-View Visual Contextualization for Web Navigation [36.41910428196889]
We propose to contextualize HTML elements through their "dual views" in webpage screenshots. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences. The resulting representations of HTML elements are more informative for the agent to take action.
arXiv Detail & Related papers (2024-02-06T23:52:10Z)
WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset [48.00110675968677]
We introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.
arXiv Detail & Related papers (2023-05-09T13:20:59Z)
WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z)
DOM-LM: Learning Generalizable Representations for HTML Documents [33.742833774918786]
We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches. We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
arXiv Detail & Related papers (2022-01-25T20:10:32Z)
CM3: A Causal Masked Multimodal Model of the Internet [86.32652030161374]
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents. We train causally masked language-image models on large-scale web and Wikipedia articles. CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts.
arXiv Detail & Related papers (2022-01-19T10:45:38Z)
FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents [16.101638575566444]
FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network.
arXiv Detail & Related papers (2020-10-21T04:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.