WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset
- URL: http://arxiv.org/abs/2305.05432v1
- Date: Tue, 9 May 2023 13:20:59 GMT
- Title: WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset
- Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan
A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo
- Abstract summary: We introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page.
WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.
- Score: 48.00110675968677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Webpages have been a rich resource for language and vision-language tasks.
Yet only pieces of webpages are kept: image-caption pairs, long text articles,
or raw HTML, never all in one place. Webpage tasks have resultingly received
little attention and structured image-text data underused. To study multimodal
webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite;
the first to retain the full set of images, text, and structure data available
in a page. WikiWeb2M can be used for tasks like page description generation,
section summarization, and contextual image captioning.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Dual-View Visual Contextualization for Web Navigation [36.41910428196889]
We propose to contextualize HTML elements through their "dual views" in webpage screenshots.
We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences.
The resulting representations of HTML elements are more informative for the agent to take action.
arXiv Detail & Related papers (2024-02-06T23:52:10Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia
Image-Caption Matching [9.56339585008373]
We present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle.
Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.
arXiv Detail & Related papers (2022-06-21T14:30:14Z) - FreeDOM: A Transferable Neural Architecture for Structured Information
Extraction on Web Documents [16.101638575566444]
FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information.
The first stage learns a representation for each DOM node in the page by combining both the text and markup information.
The second stage captures longer range distance and semantic relatedness using a relational neural network.
arXiv Detail & Related papers (2020-10-21T04:20:13Z) - WikiHist.html: English Wikipedia's Full Revision History in HTML Format [12.86558129722198]
We develop a parallelized architecture for parsing massive amounts of wikitext using local instances of markup.
We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks.
arXiv Detail & Related papers (2020-01-28T10:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.