Related papers: TextSquare: Scaling up Text-Centric Visual Instruction Tuning

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

URL: http://arxiv.org/abs/2404.12803v1
Date: Fri, 19 Apr 2024 11:38:08 GMT
Title: TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Authors: Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang,
Abstract summary: We introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M. Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs. It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks.
Score: 64.55339431760727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Related papers

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning [21.738227405440785]
Existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ.<n>We construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality.
arXiv Detail & Related papers (2026-03-03T15:17:56Z)
GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation [31.365285503503475]
We present a framework for learning spatial reasoning using 2D boxes from standard detectors.<n>We show that when trained on GRAID data, models learn spatial reasoning concepts that generalize on over-detailed held-out types.<n>We also show that when trained on all questions types, achieve improvements on several existing benchmarks.
arXiv Detail & Related papers (2025-10-25T02:07:23Z)
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes [60.57770396565211]
We show that strong reasoning abilities can emerge with far less data.<n>MobileLLM-R50M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B.
arXiv Detail & Related papers (2025-09-29T15:43:59Z)
Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text [30.74255946385862]
We introduce Text2Vis, a benchmark designed to assess text-to-visualization models.<n>It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts.<n>It reveals significant performance gaps, highlighting key challenges, and offering insights for future advancements.
arXiv Detail & Related papers (2025-07-26T14:59:04Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z)
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z)
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions. By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z)
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models [7.056824589733873]
Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. Current MLLMs trained with visual-question-answering datasets could suffer from degradation. We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
arXiv Detail & Related papers (2024-02-16T18:42:08Z)
SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions. Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z)
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat. Our objective is to capture the breadth of interactions that a human might have with an AI assistant. We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z)
Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences. Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z)
Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To Benchmark [14.50261153230204]
We focus on Multimodal Machine Reading (M3C) where a model is expected to answer questions based on given passage (or context) We identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models. We propose a systematic framework to address these biases through three Control-Knobs.
arXiv Detail & Related papers (2021-10-22T16:33:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.