Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
- URL: http://arxiv.org/abs/2407.13766v2
- Date: Thu, 10 Oct 2024 21:03:15 GMT
- Title: Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
- Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan,
- Abstract summary: Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.
The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.
We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
- Score: 63.296342841358815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs.
Related papers
- AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction.
We show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference.
arXiv Detail & Related papers (2024-09-20T10:50:21Z) - LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture [18.459825048813336]
LongLLaVA is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness.
It could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
arXiv Detail & Related papers (2024-09-04T17:25:21Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs)
MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions.
Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z) - LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models [70.2997884478129]
We introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs.
We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs.
arXiv Detail & Related papers (2024-07-10T17:59:43Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal
Models with Multiple Image Inputs [48.269363759989915]
The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching.
We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
arXiv Detail & Related papers (2024-01-05T00:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.