Harnessing PDF Data for Improving Japanese Large Multimodal Models
- URL: http://arxiv.org/abs/2502.14778v2
- Date: Sat, 31 May 2025 10:33:53 GMT
- Title: Harnessing PDF Data for Improving Japanese Large Multimodal Models
- Authors: Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa,
- Abstract summary: Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited.<n>Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge.<n>We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs.
- Score: 56.80385809059738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
Related papers
- Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework [38.98519875112922]
Vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts.<n>We reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs.<n>We observe a +9.5% improvement over LLaVA-1.6-una-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations.
arXiv Detail & Related papers (2026-02-15T09:54:40Z) - Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation [3.3393607383304253]
We develop framework that screens source sentences to form efficient parallel text.<n>We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality.<n>This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.
arXiv Detail & Related papers (2026-01-13T15:05:19Z) - DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering [42.08511799479111]
This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling.<n>We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement.
arXiv Detail & Related papers (2025-11-30T08:09:43Z) - WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models [29.864478753087138]
WAON is a large-scale and high-quality Japanese image-text pair dataset.<n>To evaluate its effectiveness, we construct WAON-Bench, a benchmark for Japanese cultural image classification.<n>We fine-tune SigLIP2, a strong multilingual model, on both datasets.
arXiv Detail & Related papers (2025-10-25T12:42:42Z) - FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation [43.26446958873554]
We present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM)<n>FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels.
arXiv Detail & Related papers (2025-05-20T12:09:17Z) - Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda [0.0]
We explore the application of Back translation as a semi-supervised technique to enhance Neural Machine Translation models for the English-Luganda language pair.<n>Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques.<n>The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions.
arXiv Detail & Related papers (2025-05-05T08:47:52Z) - Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance [0.0]
We present our latest Hindi-English bi-lingual LLM textbfMantra-14B with 3% average improvement in benchmark scores over both languages.
We instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi.
Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead.
arXiv Detail & Related papers (2025-04-13T23:10:13Z) - Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.<n>This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.<n>Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes [2.0109318570325847]
We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector.
We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET.
Our findings reveal improvement in translation performance with larger datasets across all metrics.
arXiv Detail & Related papers (2024-09-05T12:06:38Z) - Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.