Perceiver IO: A General Architecture for Structured Inputs & Outputs
- URL: http://arxiv.org/abs/2107.14795v2
- Date: Mon, 2 Aug 2021 17:18:43 GMT
- Title: Perceiver IO: A General Architecture for Structured Inputs & Outputs
- Authors: Andrew Jaegle and Sebastian Borgeaud and Jean-Baptiste Alayrac and
Carl Doersch and Catalin Ionescu and David Ding and Skanda Koppula and Daniel
Zoran and Andrew Brock and Evan Shelhamer and Olivier H\'enaff and Matthew M.
Botvinick and Andrew Zisserman and Oriol Vinyals and Jo\~ao Carreira
- Abstract summary: Perceiver IO learns to flexibly query the model's latent space to produce outputs of arbitrary size and semantics.
The model achieves strong results on tasks with highly structured output spaces.
Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark.
- Score: 84.60656759687477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently-proposed Perceiver model obtains good results on several domains
(images, audio, multimodal, point clouds) while scaling linearly in compute and
memory with the input size. While the Perceiver supports many kinds of inputs,
it can only produce very simple outputs such as class scores. Perceiver IO
overcomes this limitation without sacrificing the original's appealing
properties by learning to flexibly query the model's latent space to produce
outputs of arbitrary size and semantics. Perceiver IO still decouples model
depth from data size and still scales linearly with data size, but now with
respect to both input and output sizes. The full Perceiver IO model achieves
strong results on tasks with highly structured output spaces, such as natural
language and visual understanding, StarCraft II, and multi-task and multi-modal
domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline
on the GLUE language benchmark without the need for input tokenization and
achieves state-of-the-art performance on Sintel optical flow estimation.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Efficient Response Generation Method Selection for Fine-Tuning Large Language Models [28.717420152590204]
Recent studies have observed that the choice of output variation used in training can affect the model's performance.
This paper proposes a scalable, approximate method for estimating the quality of a small subset of generated training data.
We show that an LLM trained on data generated by the selected strategy could lead to a significant performance gain.
arXiv Detail & Related papers (2025-02-17T13:14:11Z) - Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling [10.985444895887207]
We introduce Over-Tokenized Transformers, a framework that decouples input and output vocabularies to improve language modeling performance.
We uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance.
Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design.
arXiv Detail & Related papers (2025-01-28T14:15:42Z) - Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning [4.762390044282733]
Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks.
To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters.
In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning.
arXiv Detail & Related papers (2025-01-09T14:00:01Z) - Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.
Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.
We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention.
Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module.
An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Direction is what you need: Improving Word Embedding Compression in
Large Language Models [7.736463504706344]
This paper presents a novel loss objective to compress token embeddings in Transformer-based models by leveraging an AutoEncoder architecture.
Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity.
arXiv Detail & Related papers (2021-06-15T14:28:00Z) - Perceiver: General Perception with Iterative Attention [85.65927856589613]
We introduce the Perceiver - a model that builds upon Transformers.
We show that this architecture performs competitively or beyond strong, specialized models on classification tasks.
It also surpasses state-of-the-art results for all modalities in AudioSet.
arXiv Detail & Related papers (2021-03-04T18:20:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.