Instruction Makes a Difference
- URL: http://arxiv.org/abs/2402.00453v2
- Date: Thu, 13 Jun 2024 13:28:37 GMT
- Title: Instruction Makes a Difference
- Authors: Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney,
- Abstract summary: We introduce Instruction Document Visual Question Answering (iDocVQA) dataset and Large Language Document (LLaDoc) model.
We compare performance across document-related datasets using the recent state-of-the-art (SotA) Large Language and Vision Assistant (LLaVA)1.5 as the base model.
We also evaluate the performance of the derived models for object hallucination using the Polling-based Object Probing Evaluation (POPE) dataset.
- Score: 0.22499166814992438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Instruction Document Visual Question Answering (iDocVQA) dataset and Large Language Document (LLaDoc) model, for training Language-Vision (LV) models for document analysis and predictions on document images, respectively. Usually, deep neural networks for the DocVQA task are trained on datasets lacking instructions. We show that using instruction-following datasets improves performance. We compare performance across document-related datasets using the recent state-of-the-art (SotA) Large Language and Vision Assistant (LLaVA)1.5 as the base model. We also evaluate the performance of the derived models for object hallucination using the Polling-based Object Probing Evaluation (POPE) dataset. The results show that instruction-tuning performance ranges from 11X to 32X of zero-shot performance and from 0.1% to 4.2% over non-instruction (traditional task) finetuning. Despite the gains, these still fall short of human performance (94.36%), implying there's much room for improvement.
Related papers
- Small Language Model as Data Prospector for Large Language Model [22.659698878699032]
textttNUGGETS identifies and selects high-quality quality data from a large dataset.<n>textttSuperNUGGETS uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances.<n>The experimental results show that the performance of textttSuperNUGGETS only decreases by 1-2% compared to textttNUGGETS, but the efficiency can be increased by a factor of 58.
arXiv Detail & Related papers (2024-12-13T09:23:58Z) - HRVDA: High-Resolution Visual Document Assistant [32.51417315241559]
We propose a High-Resolution Visual Document Assistant (HRVDA) to bridge the gap between MLLMs and visual document understanding.
HRVDA employs a content filtering mechanism and an instruction filtering module to filter out the content-agnostic visual tokens and instruction-agnostic visual tokens.
Our model achieves state-of-the-art performance across multiple document understanding datasets.
arXiv Detail & Related papers (2024-04-10T11:10:50Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Leveraging Vision-Language Foundation Models for Fine-Grained Downstream
Tasks [17.367599062853156]
Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets.
We propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models.
arXiv Detail & Related papers (2023-07-13T15:05:34Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Do Models Really Learn to Follow Instructions? An Empirical Study of
Instruction Tuning [37.01833561948585]
Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks.
We analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions.
arXiv Detail & Related papers (2023-05-19T02:00:47Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Test-Time Adaptation for Visual Document Understanding [34.79168501080629]
DocTTA is a novel test-time adaptation method for documents.
It does source-free domain adaptation using unlabeled target document data.
We introduce new benchmarks using existing public datasets for various VDU tasks.
arXiv Detail & Related papers (2022-06-15T01:57:12Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.