MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects
- URL: http://arxiv.org/abs/2412.04867v1
- Date: Fri, 06 Dec 2024 09:01:10 GMT
- Title: MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects
- Authors: Lei Fan, Dongdong Fan, Zhiguang Hu, Yiwen Ding, Donglin Di, Kai Yi, Maurice Pagnucco, Yang Song,
- Abstract summary: We present MANTA, a visual-text anomaly detection dataset for tiny objects.<n>The visual component comprises over 137.3K images across 38 object categories spanning five typical domains.<n>The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty.
- Score: 18.711657127220665
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present MANTA, a visual-text anomaly detection dataset for tiny objects. The visual component comprises over 137.3K images across 38 object categories spanning five typical domains, of which 8.6K images are labeled as anomalous with pixel-level annotations. Each image is captured from five distinct viewpoints to ensure comprehensive object coverage. The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies across various domains and specific categories, with detailed explanations for < what, why, how>, including causes and visual characteristics; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty, each paired with images and corresponded answer explanations. We also propose a baseline for visual-text tasks and conduct extensive benchmarking experiments to evaluate advanced methods across different settings, highlighting the challenges and efficacy of our dataset.
Related papers
- Composed Object Retrieval: Object-level Retrieval via Composed Expressions [71.47650333199628]
Composed Object Retrieval (COR) is a brand-new task that goes beyond image-level retrieval to achieve object-level precision.<n>We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories.<n>We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning.
arXiv Detail & Related papers (2025-08-06T13:11:40Z) - MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive)
It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss.
Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation [19.987706084203523]
We propose Panoptic Perception, a novel task and a new fine-grained dataset (FineGrip) to achieve a more thorough and universal interpretation for RSIs.
The new task integrates pixel-level, instance-level, and image-level information for universal image perception.
FineGrip dataset includes 2,649 remote sensing images, 12,054 fine-grained instance segmentation masks belonging to 20 foreground things categories, 7,599 background semantic masks for 5 stuff classes and 13,245 captioning sentences.
arXiv Detail & Related papers (2024-04-06T12:27:21Z) - V3Det: Vast Vocabulary Visual Detection Dataset [69.50942928928052]
V3Det is a vast vocabulary visual detection dataset with precisely annotated bounding boxes on massive images.
By offering a vast exploration space, V3Det enables extensive benchmarks on both vast and open vocabulary object detection.
arXiv Detail & Related papers (2023-04-07T17:45:35Z) - Salient Object Detection for Images Taken by People With Vision
Impairments [13.157939981657886]
We introduce a new salient object detection dataset using images taken by people who are visually impaired.
VizWiz-SalientObject is the largest (i.e., 32,000 human-annotated images) and contains unique characteristics.
We benchmarked seven modern salient object detection methods on our dataset and found they struggle most with images featuring large, have less complex boundaries, and lack text.
arXiv Detail & Related papers (2023-01-12T22:33:01Z) - Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark [80.79082788458602]
We provide a new multi-task benchmark for evaluating text-to-image models.
We compare the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models.
Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each.
arXiv Detail & Related papers (2022-11-22T09:27:53Z) - VizWiz-FewShot: Locating Objects in Images Taken by People With Visual
Impairments [74.72656607288185]
We introduce a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took.
It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments.
Compared to existing few-shot object detection and instance segmentation datasets, our dataset is the first to locate holes in objects.
arXiv Detail & Related papers (2022-07-24T20:44:51Z) - ACDC: The Adverse Conditions Dataset with Correspondences for Robust Semantic Driving Scene Perception [86.03633244019954]
Level-5 driving automation requires a robust visual perception system that can parse input images under any condition.
We introduce ACDC, the Adverse Conditions dataset for training and testing methods for diverse semantic perception tasks on adverse visual conditions.
A detailed empirical study demonstrates the challenges that the adverse domains of ACDC pose to state-of-the-art supervised and unsupervised approaches.
arXiv Detail & Related papers (2021-04-27T18:00:05Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.