Related papers: Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

URL: http://arxiv.org/abs/2410.08695v2
Date: Tue, 5 Nov 2024 03:56:21 GMT
Title: Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Authors: Yue Yang, Shuibai Zhang, Wenqi Shao, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo,
Abstract summary: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks. We introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB) VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity.
Score: 45.584695790489484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models [19.361686225381447]
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL)<n>We propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer.
arXiv Detail & Related papers (2025-06-09T16:55:32Z)
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving [50.50405233978406]
We propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG) OVPG aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples.
arXiv Detail & Related papers (2025-04-15T05:29:31Z)
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering [8.21219588747224]
This paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-04-11T05:51:44Z)
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect. We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval.
arXiv Detail & Related papers (2025-03-06T07:01:08Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. We introduce AutoBench-V, an automated framework for serving evaluation on demand. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [48.455597568212944]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.
arXiv Detail & Related papers (2024-10-10T17:59:22Z)
Towards Robust Multimodal Sentiment Analysis with Incomplete Data [20.75292807497547]
We present an innovative Language-dominated Noise-resistant Learning Network (LNLN) to achieve robust Multimodal Sentiment Analysis (MSA) LNLN features a dominant modality correction (DMC) module and dominant modality based multimodal learning (DMML) module, which enhances the model's robustness across various noise scenarios.
arXiv Detail & Related papers (2024-09-30T07:14:31Z)
Multiview learning with twin parametric margin SVM [0.0]
Multiview learning (MVL) seeks to leverage the benefits of diverse perspectives to complement each other. We propose multiview twin parametric margin support vector machine (MvTPMSVM) MvTPMSVM constructs parametric margin hyperplanes corresponding to both classes, aiming to regulate and manage the impact of the heteroscedastic noise structure.
arXiv Detail & Related papers (2024-08-04T10:16:11Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches. We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z)
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks [76.25209974199274]
Large vision-language models (LVLMs) exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.
arXiv Detail & Related papers (2023-10-04T04:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.