Towards Modality Transferable Visual Information Representation with
Optimal Model Compression
- URL: http://arxiv.org/abs/2008.05642v1
- Date: Thu, 13 Aug 2020 01:52:40 GMT
- Title: Towards Modality Transferable Visual Information Representation with
Optimal Model Compression
- Authors: Rongqun Lin, Linwei Zhu, Shiqi Wang and Sam Kwong
- Abstract summary: We propose a new scheme for visual signal representation that leverages the philosophy of transferable modality.
The proposed framework is implemented on the state-of-the-art video coding standard.
- Score: 67.89885998586995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compactly representing the visual signals is of fundamental importance in
various image/video-centered applications. Although numerous approaches were
developed for improving the image and video coding performance by removing the
redundancies within visual signals, much less work has been dedicated to the
transformation of the visual signals to another well-established modality for
better representation capability. In this paper, we propose a new scheme for
visual signal representation that leverages the philosophy of transferable
modality. In particular, the deep learning model, which characterizes and
absorbs the statistics of the input scene with online training, could be
efficiently represented in the sense of rate-utility optimization to serve as
the enhancement layer in the bitstream. As such, the overall performance can be
further guaranteed by optimizing the new modality incorporated. The proposed
framework is implemented on the state-of-the-art video coding standard (i.e.,
versatile video coding), and significantly better representation capability has
been observed based on extensive evaluations.
Related papers
- High Efficiency Image Compression for Large Visual-Language Models [14.484831372497437]
Large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks.
We propose a variable image compression framework consisting of a pre-editing module and an end-to-end to achieve promising rate-accuracy performance.
arXiv Detail & Related papers (2024-07-24T07:37:12Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - Interactive Face Video Coding: A Generative Compression Framework [18.26476468644723]
We propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals.
The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression and headpose animation.
arXiv Detail & Related papers (2023-02-20T11:24:23Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Adaptive Intermediate Representations for Video Understanding [50.64187463941215]
We introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding.
We propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task.
We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.
arXiv Detail & Related papers (2021-04-14T21:37:23Z) - Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images.
Our core idea is to extract compact basis sets from all the reference images as higher-level representations.
We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z) - An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond
Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding.
We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern.
By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.