JEMA: A Joint Embedding Framework for Scalable Co-Learning with Multimodal Alignment
- URL: http://arxiv.org/abs/2410.23988v1
- Date: Thu, 31 Oct 2024 14:42:26 GMT
- Title: JEMA: A Joint Embedding Framework for Scalable Co-Learning with Multimodal Alignment
- Authors: Joao Sousa, Roya Darabi, Armando Sousa, Frank Brueckner, Luís Paulo Reis, Ana Reis,
- Abstract summary: JEMA (Joint Embedding with Multimodal Alignment) is a novel co-learning framework tailored for laser metal deposition (LMD)
We report an 8% increase in performance in multimodal settings and a 1% improvement in unimodal settings compared to supervised contrastive learning.
Our framework lays the foundation for integrating multisensor data with metadata, enabling diverse downstream tasks within the LMD domain and beyond.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work introduces JEMA (Joint Embedding with Multimodal Alignment), a novel co-learning framework tailored for laser metal deposition (LMD), a pivotal process in metal additive manufacturing. As Industry 5.0 gains traction in industrial applications, efficient process monitoring becomes increasingly crucial. However, limited data and the opaque nature of AI present challenges for its application in an industrial setting. JEMA addresses this challenges by leveraging multimodal data, including multi-view images and metadata such as process parameters, to learn transferable semantic representations. By applying a supervised contrastive loss function, JEMA enables robust learning and subsequent process monitoring using only the primary modality, simplifying hardware requirements and computational overhead. We investigate the effectiveness of JEMA in LMD process monitoring, focusing specifically on its generalization to downstream tasks such as melt pool geometry prediction, achieved without extensive fine-tuning. Our empirical evaluation demonstrates the high scalability and performance of JEMA, particularly when combined with Vision Transformer models. We report an 8% increase in performance in multimodal settings and a 1% improvement in unimodal settings compared to supervised contrastive learning. Additionally, the learned embedding representation enables the prediction of metadata, enhancing interpretability and making possible the assessment of the added metadata's contributions. Our framework lays the foundation for integrating multisensor data with metadata, enabling diverse downstream tasks within the LMD domain and beyond.
Related papers
- Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.
tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z) - IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents [17.301758094000125]
Large language model (LLM) agents have emerged as a promising solution to automate the development of computer vision models.
We introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design.
Iterative Refinement improves stability, interpretability, and overall model performance.
arXiv Detail & Related papers (2025-02-25T01:52:37Z) - A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction [5.73110247142357]
We present a novel dataset that captures realistic assembly and disassembly tasks.
The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video.
Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments.
arXiv Detail & Related papers (2025-01-10T12:57:33Z) - Multimodal LLM for Intelligent Transportation Systems [0.0]
This paper introduces a novel 3-dimensional framework that encapsulates the intersection of applications, machine learning methodologies, and hardware devices.
Instead of using multiple machine learning algorithms, our framework uses a single, data-centric LLM architecture that can analyze time series, images, and videos.
We apply this LLM framework to different sensor datasets, including time-series data and visual data from sources like Oxford Radar RobotCar, D-Behavior (D-Set), nuScenes by Motional, and Comma2k19.
arXiv Detail & Related papers (2024-12-16T11:50:30Z) - Unsupervised Multimodal Fusion of In-process Sensor Data for Advanced Manufacturing Process Monitoring [0.0]
This paper presents a novel approach to multimodal sensor data fusion in manufacturing processes.
We leverage contrastive learning techniques to correlate different data modalities without the need for labeled data.
Our approach facilitates downstream tasks such as process control, anomaly detection, and quality assurance.
arXiv Detail & Related papers (2024-10-29T21:52:04Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation [15.35494431928751]
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving.
We introduce model-attention disaggregation to enhance the efficiency of LLM decoding.
We develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster.
arXiv Detail & Related papers (2024-05-03T02:15:15Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Machine Learning based Indicators to Enhance Process Monitoring by
Pattern Recognition [0.4893345190925177]
We propose a novel framework for machine learning based indicators combining pattern type and intensity.
In a case-study from semiconductor industry, our framework goes beyond conventional process control and achieves high quality experimental results.
arXiv Detail & Related papers (2021-03-24T10:13:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.