SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
- URL: http://arxiv.org/abs/2408.12114v3
- Date: Fri, 11 Oct 2024 06:25:42 GMT
- Title: SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
- Authors: Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, Yong Man Ro,
- Abstract summary: This paper aims to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK.
We generated 6,248 vision-language test samples to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency.
Results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents.
- Score: 43.79587815909473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at https://github.com/top-yun/SPARK
Related papers
- MSSIDD: A Benchmark for Multi-Sensor Denoising [55.41612200877861]
We introduce a new benchmark, the Multi-Sensor SIDD dataset, which is the first raw-domain dataset designed to evaluate the sensor transferability of denoising models.
We propose a sensor consistency training framework that enables denoising models to learn the sensor-invariant features.
arXiv Detail & Related papers (2024-11-18T13:32:59Z) - SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition [9.072495000412943]
We bridge the gap between wearable sensor technology and personalized AI assistants by enabling Large Language Models (LLMs) to understand time-series tasks like human activity recognition (HAR)
We introduce SensorLLM, a two-stage framework to unlock LLMs' potential for sensor data tasks.
We show that SensorLLM evolves into an effective sensor learner, reasoner, and learner, enabling it to generalize across diverse datasets for HAR tasks.
arXiv Detail & Related papers (2024-10-14T15:30:41Z) - By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting [24.39281384670957]
We propose a visual prompting approach for sensor data using multimodal large language models (MLLMs)
We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions.
We evaluate our approach on nine sensory tasks involving four sensing modalities, achieving an average of 10% higher accuracy than text-based prompts.
arXiv Detail & Related papers (2024-07-15T01:33:54Z) - On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks [61.74608497496841]
Training on inaccurate or corrupt data induces model bias and hampers generalisation capabilities.
This paper investigates the effect of sensor errors for the dense 3D vision tasks of depth estimation and reconstruction.
arXiv Detail & Related papers (2023-03-26T22:32:44Z) - ViViD++: Vision for Visibility Dataset [14.839450468199457]
We present a dataset capturing diverse visual data formats that target varying luminance conditions.
Despite the alternative sensors' potential, there still are few datasets with alternative vision sensors.
We provide these measurements along with inertial sensors and ground-truth for developing robust visual SLAM under poor illumination.
arXiv Detail & Related papers (2022-04-13T06:01:27Z) - Learning Enriched Illuminants for Cross and Single Sensor Color
Constancy [182.4997117953705]
We propose cross-sensor self-supervised training to train the network.
We train the network by randomly sampling the artificial illuminants in a sensor-independent manner.
Experiments show that our cross-sensor model and single-sensor model outperform other state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-03-21T15:45:35Z) - Analyzing General-Purpose Deep-Learning Detection and Segmentation
Models with Images from a Lidar as a Camera Sensor [0.06554326244334865]
This work explores the potential of general-purpose DL perception algorithms for processing image-like outputs of advanced lidar sensors.
Rather than processing the three-dimensional point cloud data, this is, to the best of our knowledge, the first work to focus on low-resolution images with 360text field of view.
We show that with adequate preprocessing, general-purpose DL models can process these images, opening the door to their usage in environmental conditions.
arXiv Detail & Related papers (2022-03-08T13:14:43Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z) - Learning Camera Miscalibration Detection [83.38916296044394]
This paper focuses on a data-driven approach to learn the detection of miscalibration in vision sensors, specifically RGB cameras.
Our contributions include a proposed miscalibration metric for RGB cameras and a novel semi-synthetic dataset generation pipeline based on this metric.
By training a deep convolutional neural network, we demonstrate the effectiveness of our pipeline to identify whether a recalibration of the camera's intrinsic parameters is required or not.
arXiv Detail & Related papers (2020-05-24T10:32:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.