A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations
- URL: http://arxiv.org/abs/2512.06708v1
- Date: Sun, 07 Dec 2025 07:38:36 GMT
- Title: A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations
- Authors: Waleed Razzaq, Yun-Bo Zhao,
- Abstract summary: Rolling-element bearings are among the most frequent causes of machinery failure.<n>Rolling-element bearings are among the most frequent causes of machinery failure.<n>Existing approaches often suffer from poor generalization, lack of robustness, high data demands, and limited interpretability.
- Score: 2.312232949770907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the Remaining Useful Life (RUL) of mechanical systems is pivotal in Prognostics and Health Management (PHM). Rolling-element bearings are among the most frequent causes of machinery failure, highlighting the need for robust RUL estimation methods. Existing approaches often suffer from poor generalization, lack of robustness, high data demands, and limited interpretability. This paper proposes a novel multimodal-RUL framework that jointly leverages image representations (ImR) and time-frequency representations (TFR) of multichannel, nonstationary vibration signals. The architecture comprises three branches: (1) an ImR branch and (2) a TFR branch, both employing multiple dilated convolutional blocks with residual connections to extract spatial degradation features; and (3) a fusion branch that concatenates these features and feeds them into an LSTM to model temporal degradation patterns. A multi-head attention mechanism subsequently emphasizes salient features, followed by linear layers for final RUL regression. To enable effective multimodal learning, vibration signals are converted into ImR via the Bresenham line algorithm and into TFR using Continuous Wavelet Transform. We also introduce multimodal Layer-wise Relevance Propagation (multimodal-LRP), a tailored explainability technique that significantly enhances model transparency. The approach is validated on the XJTU-SY and PRONOSTIA benchmark datasets. Results show that our method matches or surpasses state-of-the-art baselines under both seen and unseen operating conditions, while requiring ~28 % less training data on XJTU-SY and ~48 % less on PRONOSTIA. The model exhibits strong noise resilience, and multimodal-LRP visualizations confirm the interpretability and trustworthiness of predictions, making the framework highly suitable for real-world industrial deployment.
Related papers
- FITMM: Adaptive Frequency-Aware Multimodal Recommendation via Information-Theoretic Representation Learning [14.873780184982003]
We propose a Frequency-aware Information-Theoretic framework for multimodal recommendation.<n> FITMM constructs graph-enhanced item representations, performs modality-wise spectral decomposition, and forms lightweight within-band multimodal components.<n>Experiments on three real-world datasets demonstrate that FITMM consistently and significantly outperforms advanced baselines.
arXiv Detail & Related papers (2026-01-30T03:16:54Z) - Frequency Error-Guided Under-sampling Optimization for Multi-Contrast MRI Reconstruction [24.246450246745905]
Multi-contrast MRI reconstruction has emerged as a promising direction by leveraging complementary information from fully-sampled reference scans.<n>Existing approaches suffer from three major limitations: (1) superficial reference fusion strategies, (2) insufficient utilization of the complementary information provided by the reference contrast, and (3) fixed under-sampling patterns.<n>We propose an efficient and interpretable frequency error-guided reconstruction framework to tackle these issues.
arXiv Detail & Related papers (2026-01-14T09:40:34Z) - Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments [10.028232479762075]
This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure.<n> Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision.<n>By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability.
arXiv Detail & Related papers (2025-11-07T16:30:35Z) - LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment [0.0]
Retrieval-Augmented Generation has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence.<n>We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations.<n> Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0)
arXiv Detail & Related papers (2025-11-04T08:47:12Z) - FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
arXiv Detail & Related papers (2025-03-06T07:01:08Z) - Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a framework that generates multiple intermediate representations via dialogue refinements and DMs.<n>DAR results on par with finetuned I-TIR models, yet without incurring their tuning overhead.
arXiv Detail & Related papers (2025-01-26T03:29:18Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z) - Towards Long-Term Time-Series Forecasting: Feature, Pattern, and
Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning.
Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism.
We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.