Related papers: Towards Faster and More Compact Foundation Models for Molecular Property Prediction

Towards Faster and More Compact Foundation Models for Molecular Property Prediction

URL: http://arxiv.org/abs/2504.19538v1
Date: Mon, 28 Apr 2025 07:41:03 GMT
Title: Towards Faster and More Compact Foundation Models for Molecular Property Prediction
Authors: Yasir Ghunaim, Andrés Villa, Gergo Ignacz, Gyorgy Szekely, Motasem Alfarra, Bernard Ghanem,
Abstract summary: Joint Multi-domain Pre-training (JMP) foundation model has demonstrated strong performance across various downstream tasks.<n>Despite JMP's advantages, fine-tuning it on molecular datasets ranging from small-scale to large-scale requires considerable time and computational resources.<n>Our study provides insights for developing lighter, faster, and more scalable foundation models for molecular and materials discovery.
Score: 44.64301507940171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advancements in machine learning for molecular property prediction have improved accuracy but at the expense of higher computational cost and longer training times. Recently, the Joint Multi-domain Pre-training (JMP) foundation model has demonstrated strong performance across various downstream tasks with reduced training time over previous models. Despite JMP's advantages, fine-tuning it on molecular datasets ranging from small-scale to large-scale requires considerable time and computational resources. In this work, we investigate strategies to enhance efficiency by reducing model size while preserving performance. To better understand the model's efficiency, we analyze the layer contributions of JMP and find that later interaction blocks provide diminishing returns, suggesting an opportunity for model compression. We explore block reduction strategies by pruning the pre-trained model and evaluating its impact on efficiency and accuracy during fine-tuning. Our analysis reveals that removing two interaction blocks results in a minimal performance drop, reducing the model size by 32% while increasing inference throughput by 1.3x. These results suggest that JMP-L is over-parameterized and that a smaller, more efficient variant can achieve comparable performance with lower computational cost. Our study provides insights for developing lighter, faster, and more scalable foundation models for molecular and materials discovery. The code is publicly available at: https://github.com/Yasir-Ghunaim/efficient-jmp.

Related papers

Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer [17.463052541838504]
Fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy.<n>Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate interference when merging model parameters across tasks.<n>We introduce a novel method called Neural Pruning (NPS-Pruning) for slimming down fine-tuned models.
arXiv Detail & Related papers (2025-05-24T14:27:20Z)
Improved Methods for Model Pruning and Knowledge Distillation [3.8993503758122663]
MAMA Pruning is a performance optimization technique for large language models like R1 or o3-mini.<n>It effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels.<n>Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
arXiv Detail & Related papers (2025-05-20T07:53:40Z)
Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.<n>Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.<n>To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z)
TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.<n>Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z)
Are Protein Language Models Compute Optimal? [0.0]
We investigate the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
arXiv Detail & Related papers (2024-06-11T13:32:11Z)
Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations [0.0]
This thesis explores the methods of model compression. We empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.
arXiv Detail & Related papers (2024-04-02T19:53:54Z)
Reusing Pretrained Models by Multi-linear Operators for Efficient Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources. Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model. We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z)
Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision. Model compression detrimentally impacts the performance of visual prompting-based transfer. However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
Language model compression with weighted low-rank factorization [73.61874728240568]
We introduce Fisher information to weigh the importance of parameters affecting the model prediction. We find that our resulting task accuracy is much closer to the original model's performance. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies.
arXiv Detail & Related papers (2022-06-30T21:57:07Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
SuperMeshing: A New Deep Learning Architecture for Increasing the Mesh Density of Metal Forming Stress Field with Attention Mechanism and Perceptual Features [1.0660502023086995]
We propose a new data-driven mesh-density boost model named SuperMeshingNet. It strengthens the advantages of finite element analysis (FEA) with low mesh-density as inputs to the deep learning model. Compared to the baseline that applied the linear method, SuperMeshingNet achieves a prominent reduction in the mean squared error (MSE) and mean absolute error (MAE) on test data.
arXiv Detail & Related papers (2021-03-12T06:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.