Related papers: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

URL: http://arxiv.org/abs/2509.12876v1
Date: Tue, 16 Sep 2025 09:29:02 GMT
Title: Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
Authors: Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang,
Abstract summary: We present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset.<n>Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks.<n>LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings.
Score: 9.799586939041644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs [24.76767896607915]
Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs, making them prone to errors.<n>Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs)<n>We found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images.
arXiv Detail & Related papers (2025-05-21T08:45:43Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering [21.75002972755496]
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations.<n>However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning.<n>We propose textbfM$2$IV, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors.
arXiv Detail & Related papers (2025-04-06T22:02:21Z)
E2LVLM:Evidence-Enhanced Large Vision-Language Model for Multimodal Out-of-Context Misinformation Detection [7.1939657372410375]
We present E2LVLM, a novel evidence-enhanced large vision-language model by adapting textual evidence in two levels.<n>To address the scarcity of news domain datasets with both judgment and explanation, we generate a novel OOC multimodal instruction-following dataset.<n>A multitude of experiments demonstrate that E2LVLM achieves superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2025-02-12T04:25:14Z)
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.<n>We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images. This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.<n>Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.<n>We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.