Related papers: vHeat: Building Vision Models upon Heat Conduction

vHeat: Building Vision Models upon Heat Conduction

URL: http://arxiv.org/abs/2405.16555v1
Date: Sun, 26 May 2024 12:58:04 GMT
Title: vHeat: Building Vision Models upon Heat Conduction
Authors: Zhaozhi Wang, Yue Liu, Yunfan Liu, Hongtian Yu, Yaowei Wang, Qixiang Ye, Yunjie Tian,
Abstract summary: vHeat is a novel vision backbone model that simultaneously achieves both high computational efficiency and global receptive field. The essential idea is to conceptualize image patches as heat sources and model the calculation of their correlations as the diffusion of thermal energy.
Score: 63.00030330898876
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A fundamental problem in learning robust and expressive visual representations lies in efficiently estimating the spatial relationships of visual semantics throughout the entire image. In this study, we propose vHeat, a novel vision backbone model that simultaneously achieves both high computational efficiency and global receptive field. The essential idea, inspired by the physical principle of heat conduction, is to conceptualize image patches as heat sources and model the calculation of their correlations as the diffusion of thermal energy. This mechanism is incorporated into deep models through the newly proposed module, the Heat Conduction Operator (HCO), which is physically plausible and can be efficiently implemented using DCT and IDCT operations with a complexity of $\mathcal{O}(N^{1.5})$. Extensive experiments demonstrate that vHeat surpasses Vision Transformers (ViTs) across various vision tasks, while also providing higher inference speeds, reduced FLOPs, and lower GPU memory usage for high-resolution images. The code will be released at https://github.com/MzeroMiko/vHeat.

Related papers

Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks) By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z)
PearSAN: A Machine Learning Method for Inverse Design using Pearson Correlated Surrogate Annealing [66.27103948750306]
PearSAN is a machine learning-assisted optimization algorithm applicable to inverse design problems with large design spaces. It uses a Pearson correlated surrogate model to predict the figure of merit of the true design metric. It achieves a state-of-the-art maximum design efficiency of 97%, and is at least an order of magnitude faster than previous methods.
arXiv Detail & Related papers (2024-12-26T17:02:19Z)
Distillation of Diffusion Features for Semantic Correspondence [23.54555663670558]
We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence.
arXiv Detail & Related papers (2024-12-04T17:55:33Z)
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model [59.37279559684668]
We introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat applies the Heat Conduction Operator (HCO) with a complexity of $O(N1.5)$ and a global receptive field. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84%, FLOPs by 24% and improves throughput by 2.7 times.
arXiv Detail & Related papers (2024-11-27T01:43:38Z)
Enhancing Thermal MOT: A Novel Box Association Method Leveraging Thermal Identity and Motion Similarity [0.6249768559720122]
Multiple Object Tracking (MOT) in thermal imaging presents unique challenges due to the lack of visual features and the complexity of motion patterns. This paper introduces an innovative approach to improve MOT in the thermal domain by developing a novel box association method. Our method merges thermal feature sparsity and dynamic object tracking, enabling more accurate and robust MOT performance.
arXiv Detail & Related papers (2024-11-20T00:27:01Z)
Vision Calorimeter: Migrating Visual Object Detector to High-energy Particle Images [32.42087197412159]
Vision Calorimeter (ViC) is a data-driven framework which migrates visual object detection techniques to high-energy particle images. ViC significantly outperforms traditional approaches, reducing the incident position prediction error by 46.16%. This study underscores ViC's great potential as a general-purpose particle parameter estimator in high-energy physics.
arXiv Detail & Related papers (2024-08-20T07:14:28Z)
HcNet: Image Modeling with Heat Conduction Equation [6.582336726258388]
This paper aims to integrate the overall architectural design of the model into the heat conduction theory framework. Our Heat Conduction Network (HcNet) still shows competitive performance.
arXiv Detail & Related papers (2024-08-12T02:48:00Z)
Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer does not always lead to enhanced performance. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z)
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction [60.87168562615171]
The autonomous driving community has shown significant interest in 3D occupancy prediction. We propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder. COTR outperforms baselines with a relative improvement of 8%-15%.
arXiv Detail & Related papers (2023-12-04T14:23:18Z)
X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention [63.64944381130373]
In particular, predominant pose estimation methods estimate human joints by 2D single-peak heatmaps. We introduce a lightweight and powerful alternative, Spatially Unidimensional Self-Attention (SUSA), to the pointwise (1x1) convolution. Our SUSA reduces the computational complexity of the pointwise (1x1) convolution by 96% without sacrificing accuracy.
arXiv Detail & Related papers (2023-10-12T05:33:25Z)
Deep convolutional surrogates and degrees of freedom in thermal design [0.0]
Convolutional Neural Networks (CNNs) are used to predict results of Computational Fluid Dynamics (CFD) directly from topologies saved as images. We present surrogate models for heat transfer and pressure drop prediction of complex fin geometries generated using composite Bezier curves.
arXiv Detail & Related papers (2022-08-16T00:45:39Z)
Image-specific Convolutional Kernel Modulation for Single Image Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM) We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels. Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion. To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z)
Simultaneous Face Hallucination and Translation for Thermal to Visible Face Verification using Axial-GAN [74.22129648654783]
We introduce the task of thermal-to-visible face verification from low-resolution thermal images. We propose Axial-Generative Adversarial Network (Axial-GAN) to synthesize high-resolution visible images for matching.
arXiv Detail & Related papers (2021-04-13T22:34:28Z)
Learning Accurate Entropy Model with Global Reference for Image Compression [22.171750277528222]
We propose a novel Global Reference Model for image compression to leverage both the local and the global context information. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance.
arXiv Detail & Related papers (2020-10-16T11:27:46Z)
Efficient and Model-Based Infrared and Visible Image Fusion Via Algorithm Unrolling [24.83209572888164]
Infrared and visible image fusion (IVIF) expects to obtain images that retain thermal radiation information from infrared images and texture details from visible images. A model-based convolutional neural network (CNN) model is proposed to overcome the shortcomings of traditional CNN-based IVIF models.
arXiv Detail & Related papers (2020-05-12T16:15:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.