NARAIM: Native Aspect Ratio Autoregressive Image Models
- URL: http://arxiv.org/abs/2410.10012v2
- Date: Wed, 04 Dec 2024 22:21:36 GMT
- Title: NARAIM: Native Aspect Ratio Autoregressive Image Models
- Authors: Daniel Gallo Fernández, Robert van der Klis, Răzvan-Andrei Matişan, Janusz Partyka, Efstratios Gavves, Samuele Papa, Phillip Lippe,
- Abstract summary: We propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio.<n>By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information.
- Score: 26.26674614731835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.
Related papers
- Image Reconstruction as a Tool for Feature Analysis [2.0249250133493195]
We propose a novel approach for interpreting vision features via image reconstruction.<n>We show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks.<n>Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space.
arXiv Detail & Related papers (2025-06-09T14:32:18Z) - Sensitive Image Classification by Vision Transformers [1.9598097298813262]
Vision transformer models leverage a self-attention mechanism to capture global interactions among contextual local elements.
In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models.
The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities.
arXiv Detail & Related papers (2024-12-21T02:34:24Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Training-free Diffusion Model Adaptation for Variable-Sized
Text-to-Image Synthesis [45.19847146506007]
Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis.
This paper focuses on adapting text-to-image diffusion models to handle variety while maintaining visual fidelity.
arXiv Detail & Related papers (2023-06-14T17:23:07Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Exploring Stochastic Autoregressive Image Modeling for Visual
Representation [24.582376834198403]
We propose a novel autoregressive image modeling (named SAIM) by the two simple designs.
By introducing prediction and the parallel encoder-decoder, SAIM significantly improve the performance of autoregressive image modeling.
Our method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data.
arXiv Detail & Related papers (2022-12-03T13:04:29Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Masked Image Modeling with Denoising Contrast [30.31920660487222]
Masked image modeling dominates this line of research with state-of-the-art performance on vision Transformers.
We introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints.
ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.
arXiv Detail & Related papers (2022-05-19T15:22:29Z) - Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations [71.00754846434744]
We show that imperceptible additive perturbations can significantly alter the disparity map.
We show that, when used for adversarial data augmentation, our perturbations result in trained models that are more robust.
arXiv Detail & Related papers (2020-09-21T19:20:09Z) - Encoding Robustness to Image Style via Adversarial Feature Perturbations [72.81911076841408]
We adapt adversarial training by directly perturbing feature statistics, rather than image pixels, to produce robust models.
Our proposed method, Adversarial Batch Normalization (AdvBN), is a single network layer that generates worst-case feature perturbations during training.
arXiv Detail & Related papers (2020-09-18T17:52:34Z) - Distilling Visual Priors from Self-Supervised Learning [24.79633121345066]
Convolutional Neural Networks (CNNs) are prone to overfit small training datasets.
We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting.
arXiv Detail & Related papers (2020-08-01T13:07:18Z) - Self-Supervised Linear Motion Deblurring [112.75317069916579]
Deep convolutional neural networks are state-of-the-art for image deblurring.
We present a differentiable reblur model for self-supervised motion deblurring.
Our experiments demonstrate that self-supervised single image deblurring is really feasible.
arXiv Detail & Related papers (2020-02-10T20:15:21Z) - Text-to-Image Generation with Attention Based Recurrent Neural Networks [1.2599533416395765]
We develop a tractable and stable caption-based image generation model.
Experimentations are performed on Microsoft datasets.
Results show that the proposed model performs better than contemporary approaches.
arXiv Detail & Related papers (2020-01-18T12:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.