Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision
- URL: http://arxiv.org/abs/2501.04579v1
- Date: Wed, 08 Jan 2025 15:48:30 GMT
- Title: Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision
- Authors: Kangsheng Yin, Quan Liu, Xuelin Shen, Yulin He, Wenhan Yang, Shiqi Wang,
- Abstract summary: This paper introduces multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream.
The proposed Unified and Generalized Image Coding for Machine (UG-ICM) is capable of achieving remarkable improvements in various unseen machine analytics tasks.
- Score: 44.5080084219247
- License:
- Abstract: The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.
Related papers
- Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task [47.7670923159071]
This study introduces an innovative semantics DISentanglement and COmposition VERsatile (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks.
The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side.
At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks.
arXiv Detail & Related papers (2024-12-24T04:32:36Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - All-in-One Image Coding for Joint Human-Machine Vision with Multi-Path Aggregation [28.62276713652864]
We propose Multi-Path Aggregation (MPA) integrated into existing coding models for joint human-machine vision.
MPA employs a predictor to allocate latent features among task-specific paths.
MPA achieves performance comparable to state-of-the-art methods in both task-specific and multi-objective optimization.
arXiv Detail & Related papers (2024-09-29T11:14:21Z) - High Efficiency Image Compression for Large Visual-Language Models [14.484831372497437]
Large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks.
We propose a variable image compression framework consisting of a pre-editing module and an end-to-end to achieve promising rate-accuracy performance.
arXiv Detail & Related papers (2024-07-24T07:37:12Z) - Scalable Face Image Coding via StyleGAN Prior: Towards Compression for
Human-Machine Collaborative Vision [39.50768518548343]
We investigate how hierarchical representations derived from the advanced generative prior facilitate constructing an efficient scalable coding paradigm for human-machine collaborative vision.
Our key insight is that by exploiting the StyleGAN prior, we can learn three-layered representations encoding hierarchical semantics, which are elaborately designed into the basic, middle, and enhanced layers.
Based on the multi-task scalable rate-distortion objective, the proposed scheme is jointly optimized to achieve optimal machine analysis performance, human perception experience, and compression ratio.
arXiv Detail & Related papers (2023-12-25T05:57:23Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Image-specific Convolutional Kernel Modulation for Single Image
Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM)
We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels.
Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z) - Video Coding for Machine: Compact Visual Representation Compression for
Intelligent Collaborative Analytics [101.35754364753409]
Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression.
This paper summarizes VCM methodology and philosophy based on existing academia and industrial efforts.
arXiv Detail & Related papers (2021-10-18T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.