Related papers: High Efficiency Image Compression for Large Visual-Language Models

High Efficiency Image Compression for Large Visual-Language Models

URL: http://arxiv.org/abs/2407.17060v1
Date: Wed, 24 Jul 2024 07:37:12 GMT
Title: High Efficiency Image Compression for Large Visual-Language Models
Authors: Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye,
Abstract summary: Large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks. We propose a variable image compression framework consisting of a pre-editing module and an end-to-end to achieve promising rate-accuracy performance.
Score: 14.484831372497437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

Related papers

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision [44.5080084219247]
This paper introduces multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream. The proposed Unified and Generalized Image Coding for Machine (UG-ICM) is capable of achieving remarkable improvements in various unseen machine analytics tasks.
arXiv Detail & Related papers (2025-01-08T15:48:30Z)
Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs. It achieves enhanced efficiency, flexibility, and cross-modal understanding. It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings. EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering. We present an experimental evaluation of different interfacing mechanisms, across multiple tasks. We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models. Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z)
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition [39.92547393649842]
We introduce a novel Multimodal, Multi-task CLIP adapting framework named name to address these challenges. We demonstrate exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
arXiv Detail & Related papers (2024-01-22T02:03:31Z)
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations. In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z)
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models [35.5601603013045]
We propose SmartTrim, an adaptive acceleration framework for Vision-Language Models (VLMs) We integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer. We devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart.
arXiv Detail & Related papers (2023-05-24T11:18:00Z)
Towards Modality Transferable Visual Information Representation with Optimal Model Compression [67.89885998586995]
We propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. The proposed framework is implemented on the state-of-the-art video coding standard.
arXiv Detail & Related papers (2020-08-13T01:52:40Z)
A Flexible Framework for Designing Trainable Priors with Adaptive Smoothing and Game Encoding [57.1077544780653]
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems. We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions. This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end.
arXiv Detail & Related papers (2020-06-26T08:34:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.