Related papers: Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark

Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark

URL: http://arxiv.org/abs/2410.01737v2
Date: Mon, 27 Oct 2025 08:33:50 GMT
Title: Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark
Authors: Bingchen Miao, Wenqiao Zhang, Juncheng Li, Wangyu Wu, Siliang Tang, Zhaocheng Li, Haochen Shi, Jun Xiao, Yueting Zhuang,
Abstract summary: We introduce a pioneering study that investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD)<n>We find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation.<n>We propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR.
Score: 69.02666229531322
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Industrial Anomaly Detection (MIAD), which utilizes 3D point clouds and 2D RGB images to identify abnormal regions in products, plays a crucial role in industrial quality inspection. However, traditional MIAD settings assume that all 2D and 3D modalities are paired, ignoring the fact that multimodal data collected from the real world is often imperfect due to missing modalities. Additionally, models trained on modality-incomplete data are prone to overfitting. Therefore, MIAD models that demonstrate robustness against modality-incomplete data are highly desirable in practice. To address this, we introduce a pioneering study that comprehensively investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD), and under the guidance of experts, we construct the MIIAD Bench with rich modality-missing settings to account for imperfect learning environments with incomplete multimodal information. As expected, we find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation. To tackle this challenge, we propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR. Specifically: i) We propose Modality-incomplete Instruction to guide the multimodal Transformer to robustly adapt to various modality-incomplete scenarios, and implement adaptive parameter learning based on HyperNetwork. ii) Then, we construct a Double-Pseudo Hybrid Module to highlight the uniqueness of modality combinations, mitigating overfitting issues and further enhancing the robustness of the MIIAD model. Our experimental results demonstrate that the proposed RADAR significantly outperforms traditional MIAD methods on our newly created MIIAD dataset, proving its practical application value.

Related papers

MissMAC-Bench: Building Solid Benchmark for Missing Modality Issue in Robust Multimodal Affective Computing [21.70459049925545]
MissMAC-Bench is a comprehensive benchmark designed to establish fair and unified evaluation standards.<n>Two guiding principles are proposed, including no missing prior during training.<n>Our benchmark integrates evaluation protocols with both fixed and random missing patterns at the dataset and instance levels.
arXiv Detail & Related papers (2026-01-31T16:39:34Z)
Feature Fusion and Knowledge-Distilled Multi-Modal Multi-Target Detection [2.295863158976069]
We propose a feature fusion and knowledge-distilled framework for multi-modal MTD.<n>We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline.<n> Experimental results demonstrate that our student model achieves approximately 95% of the teacher model's mean Average Precision.
arXiv Detail & Related papers (2025-05-31T03:11:44Z)
Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection [53.2590751089607]
Real-IAD D3 is a high-precision multimodal dataset that incorporates an additional pseudo3D modality generated through photometric stereo. We introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance.
arXiv Detail & Related papers (2025-04-19T08:05:47Z)
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.<n>Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.<n>We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z)
Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning [18.268054258939213]
We introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task. We construct an novel framework called M$2$D-LIF, which consists of the Mono-Modality Distillation (M$2$D) method and the Local Illumination-aware Fusion (LIF) module. Our M$2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.
arXiv Detail & Related papers (2025-03-14T18:15:53Z)
Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis [6.15602203132432]
We introduce the Modality-Invariant Bidirectional Temporal Representation Distillation Network (MITR-DNet) for Missing Multimodal Sentiment Analysis. MITR-DNet employs a distillation approach, wherein a complete modality teacher model guides a missing modality student model, ensuring robustness in the presence of modality missing.
arXiv Detail & Related papers (2025-01-07T07:57:16Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [79.58755811919366]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection [23.91870504363899]
Double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data. This has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. We introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network.
arXiv Detail & Related papers (2024-05-21T17:17:17Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
CANAMRF: An Attention-Based Model for Multimodal Depression Detection [7.266707571724883]
We present a Cross-modal Attention Network with Adaptive Multi-modal Recurrent Fusion (CANAMRF) for multimodal depression detection. CANAMRF is constructed by a multimodal feature extractor, an Adaptive Multimodal Recurrent Fusion module, and a Hybrid Attention Module.
arXiv Detail & Related papers (2024-01-04T12:08:16Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Generative-based Fusion Mechanism for Multi-Modal Tracking [35.77340348091937]
We introduce Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs) We condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance.
arXiv Detail & Related papers (2023-09-04T17:22:10Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms. In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks. We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)
Multimodal Industrial Anomaly Detection via Hybrid Fusion [59.16333340582885]
We propose a novel multimodal anomaly detection method with hybrid fusion scheme. Our model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTecD-3 AD dataset.
arXiv Detail & Related papers (2023-03-01T15:48:27Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN) We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA. We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.