CViT: Continuous Vision Transformer for Operator Learning
- URL: http://arxiv.org/abs/2405.13998v2
- Date: Thu, 03 Oct 2024 18:16:07 GMT
- Title: CViT: Continuous Vision Transformer for Operator Learning
- Authors: Sifan Wang, Jacob H Seidman, Shyam Sankaran, Hanwen Wang, George J. Pappas, Paris Perdikaris,
- Abstract summary: Continuous Vision Transformer (CViT) is a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems.
CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies.
We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes.
- Score: 24.1795082775376
- License:
- Abstract: Operator learning, which aims to approximate maps between infinite-dimensional function spaces, is an important area in scientific machine learning with applications across various physical domains. Here we introduce the Continuous Vision Transformer (CViT), a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems. CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. This design allows for flexible output representations and consistent evaluation at arbitrary resolutions. We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes. Our comprehensive experiments show that CViT achieves state-of-the-art performance on multiple benchmarks, often surpassing larger foundation models, even without extensive pretraining and roll-out fine-tuning. Taken together, CViT exhibits robust handling of discontinuous solutions, multi-scale features, and intricate spatio-temporal dynamics. Our contributions can be viewed as a significant step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in the physical sciences.
Related papers
- LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Bond Graphs for multi-physics informed Neural Networks for multi-variate time series [6.775534755081169]
Existing methods are not adapted to tasks with complex multi-physical and multi-domain phenomena.
We propose a Neural Bond graph (NBgE) producing multi-physics-informed representations that can be fed into any task-specific model.
arXiv Detail & Related papers (2024-05-22T12:30:25Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - OnDev-LCT: On-Device Lightweight Convolutional Transformers towards
federated learning [29.798780069556074]
Federated learning (FL) has emerged as a promising approach to collaboratively train machine learning models across multiple edge devices.
We propose OnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks with limited training data and resources.
arXiv Detail & Related papers (2024-01-22T02:17:36Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Learned multiphysics inversion with differentiable programming and
machine learning [1.8893605328938345]
We present the Seismic Laboratory for Imaging and Modeling/Monitoring (SLIM) open-source software framework for computational geophysics.
By integrating multiple layers of abstraction, our software is designed to be both readable and scalable.
arXiv Detail & Related papers (2023-04-12T03:38:22Z) - Video Coding for Machine: Compact Visual Representation Compression for
Intelligent Collaborative Analytics [101.35754364753409]
Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression.
This paper summarizes VCM methodology and philosophy based on existing academia and industrial efforts.
arXiv Detail & Related papers (2021-10-18T12:42:13Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - AP-MTL: Attention Pruned Multi-task Learning Model for Real-time
Instrument Detection and Segmentation in Robot-assisted Surgery [23.33984309289549]
Training a real-time robotic system for the detection and segmentation of high-resolution images provides a challenging problem with the limited computational resource.
We develop a novel end-to-end trainable real-time Multi-Task Learning model with weight-shared encoder and task-aware detection and segmentation decoders.
Our model significantly outperforms state-of-the-art segmentation and detection models, including best-performed models in the challenge.
arXiv Detail & Related papers (2020-03-10T14:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.