DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering
- URL: http://arxiv.org/abs/2512.00773v1
- Date: Sun, 30 Nov 2025 08:09:43 GMT
- Title: DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering
- Authors: Toshiki Katsube, Taiga Fukuhara, Kenichiro Ando, Yusuke Mukuta, Kohei Uehara, Tatsuya Harada,
- Abstract summary: This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling.<n>We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement.
- Score: 42.08511799479111
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling. We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement under grounding constraints. Using this pipeline, we build two resources: an image-caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing 3.88M image-text pairs, far exceeding the size of existing Japanese V&L datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than datasets constructed via translation or manual annotation, while maintaining factual correctness at a level comparable to human-annotated corpora. Quantitative analyses of image feature distributions further confirm that DEJIMA broadly covers diverse visual domains characteristic of Japan, complementing its linguistic and cultural representativeness. Models trained on DEJIMA exhibit consistent improvements across multiple Japanese multimodal benchmarks, confirming that culturally grounded, large-scale resources play a key role in enhancing model performance. All data sources and modules in our pipeline are licensed for commercial use, and we publicly release the resulting dataset and metadata to encourage further research and industrial applications in Japanese V&L modeling.
Related papers
- Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z) - DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset [22.47012356405577]
We propose DanQing, which contains 100 million image-text pairs collected from Common Crawl.<n>DanQing is curated through a more rigorous selection process, yielding superior data quality.<n>We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model.
arXiv Detail & Related papers (2026-01-15T11:28:58Z) - WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models [29.864478753087138]
WAON is a large-scale and high-quality Japanese image-text pair dataset.<n>To evaluate its effectiveness, we construct WAON-Bench, a benchmark for Japanese cultural image classification.<n>We fine-tune SigLIP2, a strong multilingual model, on both datasets.
arXiv Detail & Related papers (2025-10-25T12:42:42Z) - Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z) - Harnessing PDF Data for Improving Japanese Large Multimodal Models [56.80385809059738]
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited.<n>Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge.<n>We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs.
arXiv Detail & Related papers (2025-02-20T17:59:59Z) - Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model [30.055297898544648]
We take Japanese as a non-English language and propose a method for rapidly creating Japanese multimodal datasets from scratch.
We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data directly from images using an existing VLM.
Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content.
arXiv Detail & Related papers (2024-10-30T06:46:33Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z) - A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.