Related papers: NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

URL: http://arxiv.org/abs/2602.22059v1
Date: Wed, 25 Feb 2026 16:08:46 GMT
Title: NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
Authors: Dengdi Sun, Xiaoya Zhou, Xiao Wang, Hao Si, Wanli Lyu, Jin Tang, Bin Luo,
Abstract summary: We propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework.<n>Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability.
Score: 17.27120526151699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.

Related papers

Deep Hierarchical Learning with Nested Subspace Networks [53.71337604556311]
We propose Nested Subspace Networks (NSNs) for large neural networks.<n>NSNs enable a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets.<n>We show that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier.
arXiv Detail & Related papers (2025-09-22T15:13:14Z)
Latent Mamba Operator for Partial Differential Equations [8.410938527671341]
We introduce the Latent Mamba Operator (LaMO), which integrates the efficiency of state-space models (SSMs) in latent space with the expressive power of kernel integral formulations in neural operators.<n>LaMOs achieve consistent state-of-the-art (SOTA) performance, with a 32.3% improvement over existing baselines in solution operator approximation.
arXiv Detail & Related papers (2025-05-25T11:51:31Z)
Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training [49.8035317670223]
A scientific foundation model (SciFM) is emerging as a promising tool for learning transferable representations across diverse domains.<n>We propose incorporating PDE residuals into pre-training either as the sole learning signal or in combination with data loss to compensate for limited or infeasible training data.<n>Our results show that pre-training with PDE constraints significantly enhances generalization, outperforming models trained solely on solution data.
arXiv Detail & Related papers (2025-03-24T19:12:39Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs [14.14673083512826]
Partial differential equations (PDEs) are widely used to model complex physical systems.<n>Transformers have emerged as the preferred architecture for PDEs due to their ability to capture intricate dependencies.<n>We introduce the Mamba Neural Operator (MNO), a novel framework that enhances neural operator-based techniques for solving PDEs.
arXiv Detail & Related papers (2024-10-03T00:32:31Z)
DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving [54.605760146540234]
DeltaPhi is a novel learning framework that transforms the PDE solving task from learning direct input-output mappings to learning the residuals between similar physical states.<n>Extensive experiments demonstrate consistent and significant improvements across diverse physical systems.
arXiv Detail & Related papers (2024-06-14T07:45:07Z)
Inducing Point Operator Transformer: A Flexible and Scalable Architecture for Solving PDEs [7.152311859951986]
We introduce an attention-based model called an inducing-point operator transformer (IPOT) IPOT is designed to handle any input function and output query while capturing global interactions in a computationally efficient way. By detaching the inputs/outputs discretizations from the processor with a smaller latent bottleneck, IPOT offers flexibility in processing arbitrary discretizations.
arXiv Detail & Related papers (2023-12-18T06:57:31Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Mitigating spectral bias for the multiscale operator learning [14.404769413313371]
We propose a hierarchical attention neural operator (HANO) inspired by the hierarchical matrix approach. HANO features a scale-adaptive interaction range and self-attentions over a hierarchy of levels, enabling nested feature computation with controllable linear cost. Our numerical experiments demonstrate that HANO outperforms state-of-the-art (SOTA) methods for representative multiscale problems.
arXiv Detail & Related papers (2022-10-19T21:09:29Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.