Related papers: Building Vision Models upon Heat Conduction

Building Vision Models upon Heat Conduction

URL: http://arxiv.org/abs/2405.16555v2
Date: Mon, 14 Apr 2025 10:44:13 GMT
Title: Building Vision Models upon Heat Conduction
Authors: Zhaozhi Wang, Yue Liu, Yunjie Tian, Yunfan Liu, Yaowei Wang, Qixiang Ye,
Abstract summary: This study introduces the Heat Conduction Operator (HCO) built upon the physical heat conduction principle.<n>HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion.<n> vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer.
Score: 66.1594989193046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer. Code is available at https://github.com/MzeroMiko/vHeat.

Related papers

Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks) By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z)
PearSAN: A Machine Learning Method for Inverse Design using Pearson Correlated Surrogate Annealing [66.27103948750306]
PearSAN is a machine learning-assisted optimization algorithm applicable to inverse design problems with large design spaces. It uses a Pearson correlated surrogate model to predict the figure of merit of the true design metric. It achieves a state-of-the-art maximum design efficiency of 97%, and is at least an order of magnitude faster than previous methods.
arXiv Detail & Related papers (2024-12-26T17:02:19Z)
Distillation of Diffusion Features for Semantic Correspondence [23.54555663670558]
We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence.
arXiv Detail & Related papers (2024-12-04T17:55:33Z)
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model [59.37279559684668]
We introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat applies the Heat Conduction Operator (HCO) with a complexity of $O(N1.5)$ and a global receptive field. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84%, FLOPs by 24% and improves throughput by 2.7 times.
arXiv Detail & Related papers (2024-11-27T01:43:38Z)
Enhancing Thermal MOT: A Novel Box Association Method Leveraging Thermal Identity and Motion Similarity [0.6249768559720122]
Multiple Object Tracking (MOT) in thermal imaging presents unique challenges due to the lack of visual features and the complexity of motion patterns. This paper introduces an innovative approach to improve MOT in the thermal domain by developing a novel box association method. Our method merges thermal feature sparsity and dynamic object tracking, enabling more accurate and robust MOT performance.
arXiv Detail & Related papers (2024-11-20T00:27:01Z)
Vision Calorimeter: Migrating Visual Object Detector to High-energy Particle Images [32.42087197412159]
Vision Calorimeter (ViC) is a data-driven framework which migrates visual object detection techniques to high-energy particle images. ViC significantly outperforms traditional approaches, reducing the incident position prediction error by 46.16%. This study underscores ViC's great potential as a general-purpose particle parameter estimator in high-energy physics.
arXiv Detail & Related papers (2024-08-20T07:14:28Z)
HcNet: Image Modeling with Heat Conduction Equation [6.582336726258388]
This paper aims to integrate the overall architectural design of the model into the heat conduction theory framework. Our Heat Conduction Network (HcNet) still shows competitive performance.
arXiv Detail & Related papers (2024-08-12T02:48:00Z)
Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer does not always lead to enhanced performance. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z)
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction [60.87168562615171]
The autonomous driving community has shown significant interest in 3D occupancy prediction. We propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder. COTR outperforms baselines with a relative improvement of 8%-15%.
arXiv Detail & Related papers (2023-12-04T14:23:18Z)
X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention [63.64944381130373]
In particular, predominant pose estimation methods estimate human joints by 2D single-peak heatmaps. We introduce a lightweight and powerful alternative, Spatially Unidimensional Self-Attention (SUSA), to the pointwise (1x1) convolution. Our SUSA reduces the computational complexity of the pointwise (1x1) convolution by 96% without sacrificing accuracy.
arXiv Detail & Related papers (2023-10-12T05:33:25Z)
Deep convolutional surrogates and degrees of freedom in thermal design [0.0]
Convolutional Neural Networks (CNNs) are used to predict results of Computational Fluid Dynamics (CFD) directly from topologies saved as images. We present surrogate models for heat transfer and pressure drop prediction of complex fin geometries generated using composite Bezier curves.
arXiv Detail & Related papers (2022-08-16T00:45:39Z)
Image-specific Convolutional Kernel Modulation for Single Image Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM) We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels. Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion. To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z)
Simultaneous Face Hallucination and Translation for Thermal to Visible Face Verification using Axial-GAN [74.22129648654783]
We introduce the task of thermal-to-visible face verification from low-resolution thermal images. We propose Axial-Generative Adversarial Network (Axial-GAN) to synthesize high-resolution visible images for matching.
arXiv Detail & Related papers (2021-04-13T22:34:28Z)
Learning Accurate Entropy Model with Global Reference for Image Compression [22.171750277528222]
We propose a novel Global Reference Model for image compression to leverage both the local and the global context information. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance.
arXiv Detail & Related papers (2020-10-16T11:27:46Z)
Efficient and Model-Based Infrared and Visible Image Fusion Via Algorithm Unrolling [24.83209572888164]
Infrared and visible image fusion (IVIF) expects to obtain images that retain thermal radiation information from infrared images and texture details from visible images. A model-based convolutional neural network (CNN) model is proposed to overcome the shortcomings of traditional CNN-based IVIF models.
arXiv Detail & Related papers (2020-05-12T16:15:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.