Related papers: All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

URL: http://arxiv.org/abs/2502.16989v2
Date: Fri, 30 May 2025 13:57:45 GMT
Title: All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
Authors: Davide Testa, Giovanni Bonetta, Raffaella Bernardi, Alessandro Bondielli, Alessandro Lenci, Alessio Miaschi, Lucia Passaro, Bernardo Magnini,
Abstract summary: MAIA is a benchmark for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It considers twelve categories that aim to disentangle language and vision relations by highlighting the role of the visual input.<n>MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos.
Score: 74.4821011648997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models' fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.

Related papers

TowerVision: Understanding and Improving Multilinguality in Vision-Language Models [56.775118098058506]
TowerVision is a family of open multilingual vision-language models for both image-text and video-text tasks.<n>By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches.<n>To support further research, we publicly release all models, data, and training recipes.
arXiv Detail & Related papers (2025-10-22T17:02:48Z)
Visual Representations inside the Language Model [36.35124375782294]
We study flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks.<n>We find that while the language model does augment the visual information received from the projection of input visual encodings, it contains less visual information on several tasks than the equivalent visual encoder (SigLIP)<n>Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations.
arXiv Detail & Related papers (2025-10-06T14:01:39Z)
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z)
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs [37.52094200472755]
This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias.<n>We first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs.<n>We also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias.
arXiv Detail & Related papers (2025-02-23T15:04:23Z)
VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings. We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
MIMIC-IT: Multi-Modal In-Context Instruction Tuning [44.879418596312554]
We present a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Using the MIMIC-IT dataset, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
arXiv Detail & Related papers (2023-06-08T17:59:56Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
Is Multimodal Vision Supervision Beneficial to Language? [2.216702991322677]
Vision (image and video) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision.
arXiv Detail & Related papers (2023-02-10T02:22:44Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z)
Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests. We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture. Our models learn to generate labels in text based on the visual and textual inputs. Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.