Related papers: MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance

MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance

URL: http://arxiv.org/abs/2509.09730v1
Date: Wed, 10 Sep 2025 12:07:34 GMT
Title: MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance
Authors: Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian,
Abstract summary: We introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS.<n>MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras.<n>We generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks.
Score: 10.956987319921112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS's effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5's performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6's from 0.678 to 0.921 (+35.8%), Qwen2-VL's from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL's from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.

Related papers

Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach [0.0]
This research presents an innovative method to cloud security by integrating numerous data sources and modalities with multi-modal deep learning autoencoders.<n>The proposed design integrates the best features of six deep learning models: Multi-Modal Deep Learning Autoencoder (MMDLA), Anomaly Detection using Adaptive Metric Learning (ADAM), ADADELTA, ADAGRAD, RMSPROP, and Stacked Graph Transformer (SGT)
arXiv Detail & Related papers (2025-11-26T17:10:54Z)
Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond [116.65158801881984]
We introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs.<n>We develop a unified and interpretable FER foundation model termed UniFER-7B.
arXiv Detail & Related papers (2025-11-01T03:53:00Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning [4.963955559863751]
MMAT-1M is the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage.<n>Our dataset is constructed through a novel four-stage data engine.<n>By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains.
arXiv Detail & Related papers (2025-07-29T15:39:14Z)
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs [52.79503055897109]
We present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation.<n>We propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions.
arXiv Detail & Related papers (2025-04-11T08:46:49Z)
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning.<n>We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance.<n>Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z)
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs [42.57007182613632]
We construct a benchmark to fairly benchmark over 30 different multimodal large language models (MLLMs)<n>To our knowledge, this is the first visual corresponding dataset and benchmark for the MLLM community.<n>CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15% and 11.72% OA, respectively.
arXiv Detail & Related papers (2025-01-08T18:30:53Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [81.95879920888716]
We introduce ShareGPT4V, a dataset featuring 1.2 million descriptive captions. This dataset surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM.
arXiv Detail & Related papers (2023-11-21T18:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.