Related papers: S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression

S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression

URL: http://arxiv.org/abs/2502.00700v3
Date: Mon, 24 Mar 2025 09:19:16 GMT
Title: S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression
Authors: Yunuo Chen, Qian Li, Bing He, Donghui Feng, Ronghua Wu, Qi Wang, Li Song, Guo Lu, Wenjun Zhang,
Abstract summary: Transformer-based Learned Image Compression (LIC) suffers from a suboptimal trade-off between decoding latency and rate-distortion (R-D) performance.<n>Our research reveals that efficient channel aggregation-rather than complex and time-consuming spatial operations-is the key to achieving competitive LIC models.
Score: 26.920782099405915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based Learned Image Compression (LIC) suffers from a suboptimal trade-off between decoding latency and rate-distortion (R-D) performance. Moreover, the critical role of the FeedForward Network (FFN)-based channel aggregation module has been largely overlooked. Our research reveals that efficient channel aggregation-rather than complex and time-consuming spatial operations-is the key to achieving competitive LIC models. Based on this insight, we initiate the ``S2CFormer'' paradigm, a general architecture that simplifies spatial operations and enhances channel operations to overcome the previous trade-off. We present two instances of the S2CFormer: S2C-Conv, and S2C-Attention. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. Furthermore, we introduce S2C-Hybrid, an enhanced variant that maximizes the strengths of different S2CFormer instances to achieve a better performance-latency trade-off. This model outperforms all the existing methods on the Kodak, Tecnick, and CLIC Professional Validation datasets, setting a new benchmark for efficient and high-performance LIC. The code is at \href{https://github.com/YunuoChen/S2CFormer}{https://github.com/YunuoChen/S2CFormer}.

Related papers

FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z)
Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis [50.77548592888096]
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. Turbo2K is an efficient framework for generating detail-rich 2K videos.
arXiv Detail & Related papers (2025-04-20T03:30:59Z)
Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE. We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable. Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z)
CMamba: Learned Image Compression with State Space Models [31.10785880342252]
We propose a hybrid Convolution and State Space Models (SSMs) based image compression framework to achieve superior rate-distortion performance. Specifically, CMamba introduces two key components: a Content-Adaptive SSM (CA-SSM) module and a Context-Aware Entropy (CAE) module. Experimental results demonstrate that CMamba achieves superior rate-distortion performance.
arXiv Detail & Related papers (2025-02-07T15:07:04Z)
BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution [14.082598088990352]
We propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency.
arXiv Detail & Related papers (2025-01-19T13:29:41Z)
Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.<n>DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference.<n>We propose Skip-DiT, an image and video generative DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Channel-wise Feature Decorrelation for Enhanced Learned Image Compression [16.638869231028437]
The emerging Learned Compression (LC) replaces the traditional modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance. This paper proposes to improve compression by fully exploiting the existing DNN capacity. Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks.
arXiv Detail & Related papers (2024-03-16T14:30:25Z)
Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z)
LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression [27.02281402358164]
We propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression. We introduce a few large kernelbased depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.
arXiv Detail & Related papers (2023-04-19T11:19:10Z)
Learned Image Compression with Mixed Transformer-CNN Architectures [21.53261818914534]
We propose an efficient parallel Transformer-CNN Mixture (TCM) block with a controllable complexity. Inspired by the recent progress of entropy estimation models and attention modules, we propose a channel-wise entropy model with parameter-efficient swin-transformer-based attention. Experimental results demonstrate our proposed method achieves state-of-the-art rate-distortion performances.
arXiv Detail & Related papers (2023-03-27T08:19:01Z)
Reference-based Image and Video Super-Resolution via C2-Matching [100.0808130445653]
We propose C2-Matching, which performs explicit robust matching crossing transformation and resolution. C2-Matching significantly outperforms state of the arts on the standard CUFED5 benchmark. We also extend C2-Matching to Reference-based Video Super-Resolution task, where an image taken in a similar scene serves as the HR reference image.
arXiv Detail & Related papers (2022-12-19T16:15:02Z)
Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes [0.0]
We propose to use the CTC-Prefix-Score during S2S decoding. During beam search, paths that are invalid according to the CTC confidence matrix are penalised. We evaluate this setup on three HTR data sets: IAM, Rimes, and StAZH.
arXiv Detail & Related papers (2021-10-12T11:40:05Z)
A Generic Network Compression Framework for Sequential Recommender Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations. We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed. By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.