MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition
- URL: http://arxiv.org/abs/2502.15448v1
- Date: Fri, 21 Feb 2025 13:22:29 GMT
- Title: MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition
- Authors: Paul Koch, Marian Schlüter, Jörg Krüger,
- Abstract summary: MVIP is a novel dataset for multi-modal and multi-view application-oriented industrial part recognition.<n>Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks.
- Score: 0.27309692684728604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MVIP, a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and super-classes. The current portfolio of available datasets offers a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications must deal with a small amount or increased number of training data, visually similar parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intend to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling -- combined in a single application-oriented benchmark.
Related papers
- IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval [29.05476868272228]
Instance-Driven Multimodal Image Retrieval (IDMR) is a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario.
To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data.
Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench.
arXiv Detail & Related papers (2025-04-01T16:47:20Z) - A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction [5.73110247142357]
We present a novel dataset that captures realistic assembly and disassembly tasks.<n>The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video.<n>Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments.
arXiv Detail & Related papers (2025-01-10T12:57:33Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy [2.294223504228228]
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems.<n>Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning.<n>Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview.
arXiv Detail & Related papers (2024-12-23T18:15:19Z) - Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review [1.8590097948961688]
Generative AI such as Large Language Models (LLMs) sees broad adoption to process multi-modal data such as text, images, audio, and video.
Managing this data efficiently has become a significant practical challenge in the industry-double as much data is not double as good.
This study focuses on the different semantic-aware techniques to extract embeddings from mono-modal, multi-modal, and cross-modal data.
arXiv Detail & Related papers (2024-07-17T09:49:11Z) - IPAD: Industrial Process Anomaly Detection Dataset [71.39058003212614]
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames.
We propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios.
This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage.
arXiv Detail & Related papers (2024-04-23T13:38:01Z) - Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation [5.3558730908641525]
We propose a first benchmark dataset, CAMERA, to standardize the task of ATG.
Our experiments show the current state and the remaining challenges.
We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.
arXiv Detail & Related papers (2023-09-21T12:51:24Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - MultiBench: Multiscale Benchmarks for Multimodal Representation Learning [87.23266008930045]
MultiBench is a systematic and unified benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
It provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation.
It introduces impactful challenges for future research, including robustness to large-scale multimodal datasets and robustness to realistic imperfections.
arXiv Detail & Related papers (2021-07-15T17:54:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.