Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
- URL: http://arxiv.org/abs/2511.07080v1
- Date: Mon, 10 Nov 2025 13:10:31 GMT
- Title: Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
- Authors: Khalil Hennara, Ahmad Bastati, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan,
- Abstract summary: We present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset.<n>Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content.<n>We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets.
- Score: 1.7590081165362783
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre- trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.
Related papers
- MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts [12.42628977620548]
MoST (Mixture of Speech and Text) is a novel large language model that seamlessly integrates speech and text processing.<n>We introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type.<n>MoST consistently outperforms existing models of comparable parameter counts.
arXiv Detail & Related papers (2026-01-15T10:43:29Z) - HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models [25.953042884928006]
We present an initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages.<n>At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data.<n>We train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models.
arXiv Detail & Related papers (2025-11-02T20:16:38Z) - Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation [47.714317480436215]
PREMIR is a simple framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval.<n> Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings.
arXiv Detail & Related papers (2025-08-23T16:14:41Z) - FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language [48.79534869177174]
We introduce a new pre-training dataset curation pipeline based on FineWeb.<n>We show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets.<n>We scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset.
arXiv Detail & Related papers (2025-06-26T01:01:47Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images [5.753626355995653]
jina-clip-v2 is a contrastive vision-language model trained on text pairs, triplets and image-text pairs.<n>We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages.<n>We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2024-12-11T22:28:12Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.