Reconstruction as a Bridge for Event-Based Visual Question Answering
- URL: http://arxiv.org/abs/2512.11510v1
- Date: Fri, 12 Dec 2025 12:16:45 GMT
- Title: Reconstruction as a Bridge for Event-Based Visual Question Answering
- Authors: Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi,
- Abstract summary: We propose a straightforward Frame-based Reconstruction and Tokenization (FRT) method and an efficient Adaptive Reconstruction and Tokenization (ART) method.<n>For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets.<n>Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
- Score: 50.47820990163094
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
Related papers
- Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition [54.55914886780534]
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion.<n>We introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR.<n>EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios.
arXiv Detail & Related papers (2026-02-13T13:25:05Z) - CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval [54.15776146365823]
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text.<n>We propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components.
arXiv Detail & Related papers (2026-01-07T09:21:38Z) - Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - LLM-EvRep: Learning an LLM-Compatible Event Representation Using a Self-Supervised Framework [11.30784253260618]
Large language models (LLMs) have exhibited remarkable zero-shot capabilities across diverse domains.<n>We propose textbfLLM-EvGen, an event representation generator that produces event representations textbfLLM-EvRep<n> Comprehensive experiments were conducted on three datasets: N-ImageNet, N-Caltech101, and N-MNIST.
arXiv Detail & Related papers (2025-02-20T05:18:36Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [40.77611907215627]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.<n>We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.<n>We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.