WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
- URL: http://arxiv.org/abs/2510.22276v1
- Date: Sat, 25 Oct 2025 12:42:42 GMT
- Title: WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
- Authors: Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki,
- Abstract summary: WAON is a large-scale and high-quality Japanese image-text pair dataset.<n>To evaluate its effectiveness, we construct WAON-Bench, a benchmark for Japanese cultural image classification.<n>We fine-tune SigLIP2, a strong multilingual model, on both datasets.
- Score: 29.864478753087138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale and high-quality image-text pair datasets play an important role in developing high-performing Vision-Language Models (VLMs). In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. Our dataset construction pipeline employs various techniques, including filtering and deduplication, which have been shown to be effective in previous studies. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification, consisting of 374 classes. To assess the effectiveness of our dataset, we conduct experiments using both WAON and the Japanese subset of ReLAION, one of the most widely used vision-language datasets. We fine-tune SigLIP2, a strong multilingual model, on both datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all evaluated benchmarks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks. We release our dataset, model, and code at https://speed1313.github.io/WAON.
Related papers
- DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering [42.08511799479111]
This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling.<n>We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement.
arXiv Detail & Related papers (2025-11-30T08:09:43Z) - Harnessing PDF Data for Improving Japanese Large Multimodal Models [56.80385809059738]
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited.<n>Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge.<n>We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs.
arXiv Detail & Related papers (2025-02-20T17:59:59Z) - Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models [1.9890559505377343]
Current vision-language multimodal models are well-adapted for general visual understanding tasks.<n>We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes.<n>We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate it on the benchmark, achieving significant improvements.
arXiv Detail & Related papers (2024-09-14T05:07:57Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Multilingual Diversity Improves Vision-Language Representations [97.16233528393356]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.<n>On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries.
This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts.
Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z) - From Base to Conversational: Japanese Instruction Dataset and Tuning
Large Language Models [6.520584613661788]
We construct a Japanese instruction dataset by expanding and filtering existing datasets.
We perform Low-Rank Adaptation (LoRA) tuning on both Japanese and English existing models.
arXiv Detail & Related papers (2023-09-07T00:14:37Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.