Audio-visual cross-modality knowledge transfer for machine learning-based in-situ monitoring in laser additive manufacturing
- URL: http://arxiv.org/abs/2408.05307v4
- Date: Sat, 08 Feb 2025 19:28:57 GMT
- Title: Audio-visual cross-modality knowledge transfer for machine learning-based in-situ monitoring in laser additive manufacturing
- Authors: Jiarui Xie, Mutahar Safdar, Lequn Chen, Seung Ki Moon, Yaoyao Fiona Zhao,
- Abstract summary: This paper introduces a cross-modality knowledge transfer (CMKT) methodology for LAM in-situ monitoring.
Three CMKT methods are proposed: semantic alignment, fully supervised mapping, and semi-supervised mapping.
In a case study for LAM in-situ defect detection, the proposed CMKT methods were compared with multimodal audio-visual fusion.
- Score: 2.592307869002029
- License:
- Abstract: Various machine learning (ML)-based in-situ monitoring systems have been developed to detect anomalies and defects in laser additive manufacturing (LAM) processes. While multimodal fusion, which integrates data from visual, audio, and other modalities, can improve monitoring performance, it also increases hardware, computational, and operational costs. This paper introduces a cross-modality knowledge transfer (CMKT) methodology for LAM in-situ monitoring, which transfers knowledge from a source modality to a target modality. CMKT enhances the representativeness of the features extracted from the target modality, allowing the removal of source modality sensors during prediction. This paper proposes three CMKT methods: semantic alignment, fully supervised mapping, and semi-supervised mapping. The semantic alignment method establishes a shared encoded space between modalities to facilitate knowledge transfer. It employs a semantic alignment loss to align the distributions of identical groups (e.g., visual and audio defective groups) and a separation loss to distinguish different groups (e.g., visual defective and audio defect-free groups). The two mapping methods transfer knowledge by deriving features from one modality to another using fully supervised and semi-supervised learning approaches. In a case study for LAM in-situ defect detection, the proposed CMKT methods were compared with multimodal audio-visual fusion. The semantic alignment method achieved an accuracy of 98.6% while removing the audio modality during the prediction phase, which is comparable to the 98.2% accuracy obtained through multimodal fusion. Using explainable artificial intelligence, we discovered that semantic alignment CMKT can extract more representative features while reducing noise by leveraging the inherent correlations between modalities.
Related papers
- SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection [18.090706979440334]
Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors.
Current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network.
In this paper, we introduce an accurate and efficient object detection method named SeaDATE.
arXiv Detail & Related papers (2024-10-15T07:26:39Z) - Disentangled Noisy Correspondence Learning [56.06801962154915]
Cross-modal retrieval is crucial in understanding latent correspondences across modalities.
DisNCL is a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning.
arXiv Detail & Related papers (2024-08-10T09:49:55Z) - Cross-Modal Learning for Anomaly Detection in Complex Industrial Process: Methodology and Benchmark [19.376814754500625]
Anomaly detection in complex industrial processes plays a pivotal role in ensuring efficient, stable, and secure operation.
This paper proposes a cross-modal Transformer to facilitate anomaly detection by exploring the correlation between visual features (video) and process variables (current) in the context of the fused magnesium smelting process.
We present a pioneering cross-modal benchmark of the fused magnesium smelting process, featuring synchronously acquired video and current data for over 2.2 million samples.
arXiv Detail & Related papers (2024-06-13T11:40:06Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Task-Oriented Sensing, Computation, and Communication Integration for
Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC)
We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z) - Mitigating the Mutual Error Amplification for Semi-Supervised Object
Detection [92.52505195585925]
We propose a Cross Teaching (CT) method, aiming to mitigate the mutual error amplification by introducing a rectification mechanism of pseudo labels.
In contrast to existing mutual teaching methods that directly treat predictions from other detectors as pseudo labels, we propose the Label Rectification Module (LRM)
arXiv Detail & Related papers (2022-01-26T03:34:57Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - Statistical control for spatio-temporal MEG/EEG source imaging with
desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques.
The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge.
We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.