Related papers: ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation

URL: http://arxiv.org/abs/2203.09176v1
Date: Thu, 17 Mar 2022 08:54:31 GMT
Title: ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation
Authors: Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao, JingBo Zhu, Xuebo Liu, Min Zhang
Abstract summary: This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, it ODE Transformer', which is easy to implement and efficient to use.
Score: 44.101125095045326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, {\it ODE Transformer}, which is analogous to the Runge-Kutta method that is well motivated in ODE. As a natural extension to Transformer, ODE Transformer is easy to implement and efficient to use. Experimental results on the large-scale machine translation, abstractive summarization, and grammar error correction tasks demonstrate the high genericity of ODE Transformer. It can gain large improvements in model performance over strong baselines (e.g., 30.77 and 44.11 BLEU scores on the WMT'14 English-German and English-French benchmarks) at a slight cost in inference efficiency.

Related papers

DDOT: A Derivative-directed Dual-decoder Ordinary Differential Equation Transformer for Dynamic System Modeling [16.33495160112142]
We introduce DDOT, a transformer-based model designed to reconstruct multidimensional ODEs in symbolic form.<n>By incorporating an auxiliary task predicting the ODE's derivative, DDOT effectively captures both structure and dynamic behavior.<n>DDOT outperforms existing symbolic regression methods, achieving an absolute improvement of 4.58% and 1.62% in $P(R2 > 0.9)$ for reconstruction and tasks generalization.
arXiv Detail & Related papers (2025-06-23T11:24:52Z)
Graph Transformers Dream of Electric Flow [72.06286909236827]
We show that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems. We present explicit weight configurations for implementing each such graph algorithm, and we bound the errors of the constructed Transformers by the errors of the underlying algorithms.
arXiv Detail & Related papers (2024-10-22T05:11:45Z)
On Exact Bit-level Reversible Transformers Without Changing Architectures [4.282029766809805]
reversible deep neural networks (DNNs) have been proposed to reduce memory consumption in the training process. We present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference.
arXiv Detail & Related papers (2024-07-12T08:42:58Z)
Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Discovering ordinary differential equations that govern time-series [65.07437364102931]
We propose a transformer-based sequence-to-sequence model that recovers scalar autonomous ordinary differential equations (ODEs) in symbolic form from time-series data of a single observed solution of the ODE. Our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing laws of a new observed solution in a few forward passes of the model.
arXiv Detail & Related papers (2022-11-05T07:07:58Z)
Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts. In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z)
THG: Transformer with Hyperbolic Geometry [8.895324519034057]
"X-former" models make changes only around the quadratic time and memory complexity of self-attention. We propose a novel Transformer with Hyperbolic Geometry (THG) model, which take the advantage of both Euclidean space and Hyperbolic space.
arXiv Detail & Related papers (2021-06-01T14:09:33Z)
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation [25.86053637998726]
We show that a residual block of layers in Transformer can be described as a higher-order solution to ODEs. As a natural extension to Transformer, ODE Transformer is easy to implement and parameter efficient.
arXiv Detail & Related papers (2021-04-06T06:13:02Z)
N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations [1.2183405753834562]
We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. We consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem.
arXiv Detail & Related papers (2020-10-22T00:48:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.