General-purpose, long-context autoregressive modeling with Perceiver AR
- URL: http://arxiv.org/abs/2202.07765v1
- Date: Tue, 15 Feb 2022 22:31:42 GMT
- Title: General-purpose, long-context autoregressive modeling with Perceiver AR
- Authors: Curtis Hawthorne, Andrew Jaegle, C\u{a}t\u{a}lina Cangea, Sebastian
Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals,
Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste
Alayrac, Jo\~ao Carreira, Jesse Engel
- Abstract summary: We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to latents.
Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation.
Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
- Score: 58.976153199352254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world data is high-dimensional: a book, image, or musical performance
can easily contain hundreds of thousands of elements even after compression.
However, the most commonly used autoregressive models, Transformers, are
prohibitively expensive to scale to the number of inputs and layers needed to
capture this long-range structure. We develop Perceiver AR, an autoregressive,
modality-agnostic architecture which uses cross-attention to map long-range
inputs to a small number of latents while also maintaining end-to-end causal
masking. Perceiver AR can directly attend to over a hundred thousand tokens,
enabling practical long-context density estimation without the need for
hand-crafted sparsity patterns or memory mechanisms. When trained on images or
music, Perceiver AR generates outputs with clear long-term coherence and
structure. Our architecture also obtains state-of-the-art likelihood on
long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction [19.234118544637592]
Long-LRM++ is a model that adopts a semi-explicit scene representation combined with a lightweight decoder.<n>Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU.<n>Our design also scales to 64 input views at the $950times540$ resolution, demonstrating strong generalization to increased input lengths.
arXiv Detail & Related papers (2025-12-11T04:10:21Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation [87.00172597953228]
Speculative decoding has shown promise in accelerating text generation without compromising quality.<n>We introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions.<n> Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models.
arXiv Detail & Related papers (2025-10-29T17:43:31Z) - ArchGPT: Understanding the World's Architectures with Large Multimodal Models [6.504675786709239]
We present ArchGPT, a multimodal architectural visual question answering (VQA) model.<n>This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets.
arXiv Detail & Related papers (2025-09-25T07:49:43Z) - NFIG: Autoregressive Image Generation with Next-Frequency Prediction [50.69346038028673]
We present textbfNext-textbfFrequency textbfImage textbfGeneration (textbfNFIG), a novel framework that decomposes the image generation process into multiple frequency-guided stages.<n>Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images.
arXiv Detail & Related papers (2025-03-10T08:59:10Z) - M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation [39.97174784206976]
We show that this scale-wise autoregressive framework can be effectively decoupled into textitintra-scale modeling
We apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead.
Experiments demonstrate that our method outperforms existing models in both image quality and generation speed.
arXiv Detail & Related papers (2024-11-15T18:54:42Z) - Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling [15.013242103936625]
We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR)
SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens.
We explore the properties of SAR by analyzing the impact of sequence order and output intervals on performance.
arXiv Detail & Related papers (2024-10-14T13:49:06Z) - LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture [42.12601112927263]
LongLLaVA achieves competitive results across various benchmarks while maintaining high throughput and low memory consumption.<n> Notably, it can process nearly one thousand images on a single A100 80GB GPU, underscoring its potential for a wide range of multi-modal applications.
arXiv Detail & Related papers (2024-09-04T17:25:21Z) - Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models [22.702352459581434]
Serpent is an efficient architecture for high-resolution image restoration.
We show that Serpent can achieve reconstruction quality on par with state-of-the-art techniques.
arXiv Detail & Related papers (2024-03-26T17:43:15Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - Efficient Image Captioning for Edge Devices [8.724184244203892]
We propose LightCap, a lightweight image captioner for resource-limited devices.
The core design is built on the recent CLIP model for efficient image captioning.
With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98%.
arXiv Detail & Related papers (2022-12-18T01:56:33Z) - LaMAR: Benchmarking Localization and Mapping for Augmented Reality [80.23361950062302]
We introduce LaMAR, a new benchmark with a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices.
We publish a benchmark dataset of diverse and large-scale scenes recorded with head-mounted and hand-held AR devices.
arXiv Detail & Related papers (2022-10-19T17:58:17Z) - Lightweight Long-Range Generative Adversarial Networks [58.16484259508973]
We introduce a novel lightweight generative adversarial networks, which can effectively capture long-range dependencies in the image generation process.
The proposed long-range module can highlight negative relations between pixels, working as a regularization to stabilize training.
Our novel long-range module only introduces few additional parameters and is easily inserted into existing models to capture long-range dependencies.
arXiv Detail & Related papers (2022-09-08T13:05:01Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals.
Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs.
The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z) - Residual Attention Net for Superior Cross-Domain Time Sequence Modeling [0.0]
This paper serves as a proof-of-concept for a new architecture, with RAN aiming at providing the model a higher level understanding of sequence patterns.
We have achieved 35 state-of-the-art results with 10 results matching current state-of-the-art results without further model fine-tuning.
arXiv Detail & Related papers (2020-01-13T06:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.