Quantised Global Autoencoder: A Holistic Approach to Representing Visual Data
- URL: http://arxiv.org/abs/2407.11913v2
- Date: Mon, 5 Aug 2024 17:50:03 GMT
- Title: Quantised Global Autoencoder: A Holistic Approach to Representing Visual Data
- Authors: Tim Elsner, Paula Usinger, Victor Czech, Gregor Kobsik, Yanjiang He, Isaak Lim, Leif Kobbelt,
- Abstract summary: In quantised autoencoders, images are usually split into local patches, each encoded by one token.
Our method is inspired by spectral decompositions which transform an input signal into a superposition of global frequencies.
- Score: 7.152103069753289
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In quantised autoencoders, images are usually split into local patches, each encoded by one token. This representation is redundant in the sense that the same number of tokens is spend per region, regardless of the visual information content in that region. Adaptive discretisation schemes like quadtrees are applied to allocate tokens for patches with varying sizes, but this just varies the region of influence for a token which nevertheless remains a local descriptor. Modern architectures add an attention mechanism to the autoencoder which infuses some degree of global information into the local tokens. Despite the global context, tokens are still associated with a local image region. In contrast, our method is inspired by spectral decompositions which transform an input signal into a superposition of global frequencies. Taking the data-driven perspective, we learn custom basis functions corresponding to the codebook entries in our VQ-VAE setup. Furthermore, a decoder combines these basis functions in a non-linear fashion, going beyond the simple linear superposition of spectral decompositions. We can achieve this global description with an efficient transpose operation between features and channels and demonstrate our performance on compression.
Related papers
- Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks [59.12788703213031]
We present Omni-RGPT, a large language model designed to facilitate region-level comprehension for both images and videos.
We introduce Token Mark, a set of tokens highlighting the target regions within the visual-temporal feature space.
We also introduce a large-scale region-level video instruction dataset (VID-300k)
arXiv Detail & Related papers (2025-01-14T18:58:04Z) - Tokenphormer: Structure-aware Multi-token Graph Transformer for Node Classification [9.967313792318606]
We propose the Structure-aware Multi-token Graph Transformer (Tokenphormer)
It generates multiple tokens to capture local and structural information and explore global information at different levels of granularity.
Experimental results demonstrate that the capability of the proposed Tokenphormer can achieve state-of-the-art performance on node classification tasks.
arXiv Detail & Related papers (2024-12-19T10:44:18Z) - LGFCTR: Local and Global Feature Convolutional Transformer for Image
Matching [8.503217766507584]
A novel convolutional transformer is proposed to capture both local contexts and global structures.
A universal FPN-like framework captures global structures in self-encoder as well as cross-decoder by transformers.
A novel regression-based sub-pixel refinement module exploits the whole fine-grained window features for fine-level positional deviation regression.
arXiv Detail & Related papers (2023-11-29T12:06:19Z) - Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - Local2Global: A distributed approach for scaling representation learning
on graphs [10.254620252788776]
We propose a decentralised "local2global"' approach to graph representation learning, that one can a-priori use to scale any embedding technique.
We show that our approach achieves a good trade-off between scale and accuracy on edge reconstruction and semi-supervised classification.
We also consider the downstream task of anomaly detection and show how one can use local2global to highlight anomalies in cybersecurity networks.
arXiv Detail & Related papers (2022-01-12T23:00:22Z) - Locally Shifted Attention With Early Global Integration [93.5766619842226]
We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer.
Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
arXiv Detail & Related papers (2021-12-09T18:12:24Z) - A Volumetric Transformer for Accurate 3D Tumor Segmentation [25.961484035609672]
This paper presents a Transformer architecture for medical image segmentation.
The Transformer has a U-shaped volumetric encoder-decoder design that processes the input voxels in their entirety.
We show that our model transfer better representations across-datasets and are robust against data corruptions.
arXiv Detail & Related papers (2021-11-26T02:49:51Z) - Global and Local Alignment Networks for Unpaired Image-to-Image
Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style.
Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation.
We introduce a novel approach, Global and Local Alignment Networks (GLA-Net)
Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A
Learnable Scene Descriptor [51.298760338410624]
We propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information.
The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene.
We also design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label.
arXiv Detail & Related papers (2020-01-24T16:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.