ODE Transformer: An Ordinary Differential Equation-Inspired Model for
Neural Machine Translation
- URL: http://arxiv.org/abs/2104.02308v1
- Date: Tue, 6 Apr 2021 06:13:02 GMT
- Title: ODE Transformer: An Ordinary Differential Equation-Inspired Model for
Neural Machine Translation
- Authors: Bei Li, Quan Du, Tao Zhou, Shuhan Zhou, Xin Zeng, Tong Xiao, Jingbo
Zhu
- Abstract summary: We show that a residual block of layers in Transformer can be described as a higher-order solution to ODEs.
As a natural extension to Transformer, ODE Transformer is easy to implement and parameter efficient.
- Score: 25.86053637998726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been found that residual networks are an Euler discretization of
solutions to Ordinary Differential Equations (ODEs). In this paper, we explore
a deeper relationship between Transformer and numerical methods of ODEs. We
show that a residual block of layers in Transformer can be described as a
higher-order solution to ODEs. This leads us to design a new architecture (call
it ODE Transformer) analogous to the Runge-Kutta method that is well motivated
in ODEs. As a natural extension to Transformer, ODE Transformer is easy to
implement and parameter efficient. Our experiments on three WMT tasks
demonstrate the genericity of this model, and large improvements in performance
over several strong baselines. It achieves 30.76 and 44.11 BLEU scores on the
WMT'14 En-De and En-Fr test data. This sets a new state-of-the-art on the
WMT'14 En-Fr task.
Related papers
- DDOT: A Derivative-directed Dual-decoder Ordinary Differential Equation Transformer for Dynamic System Modeling [16.33495160112142]
We introduce DDOT, a transformer-based model designed to reconstruct multidimensional ODEs in symbolic form.<n>By incorporating an auxiliary task predicting the ODE's derivative, DDOT effectively captures both structure and dynamic behavior.<n>DDOT outperforms existing symbolic regression methods, achieving an absolute improvement of 4.58% and 1.62% in $P(R2 > 0.9)$ for reconstruction and tasks generalization.
arXiv Detail & Related papers (2025-06-23T11:24:52Z) - On the Trajectory Regularity of ODE-based Diffusion Sampling [79.17334230868693]
Diffusion-based generative models use differential equations to establish a smooth connection between a complex data distribution and a tractable prior distribution.
In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models.
arXiv Detail & Related papers (2024-05-18T15:59:41Z) - Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer.
Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size.
We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z) - HAMLET: Graph Transformer Neural Operator for Partial Differential Equations [13.970458554623939]
We present a novel graph transformer framework, HAMLET, designed to address the challenges in solving partial differential equations (PDEs) using neural networks.
The framework uses graph transformers with modular input encoders to directly incorporate differential equation information into the solution process.
Notably, HAMLET scales effectively with increasing data complexity and noise, showcasing its robustness.
arXiv Detail & Related papers (2024-02-05T21:55:24Z) - Predicting Ordinary Differential Equations with Transformers [65.07437364102931]
We develop a transformer-based sequence-to-sequence model that recovers scalar ordinary differential equations (ODEs) in symbolic form from irregularly sampled and noisy observations of a single solution trajectory.
Our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing law of a new observed solution in a few forward passes of the model.
arXiv Detail & Related papers (2023-07-24T08:46:12Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - A Neural ODE Interpretation of Transformer Layers [8.839601328192957]
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems.
We build upon this connection and propose a modification of the internal architecture of a transformer layer.
Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks.
arXiv Detail & Related papers (2022-12-12T16:18:58Z) - Discovering ordinary differential equations that govern time-series [65.07437364102931]
We propose a transformer-based sequence-to-sequence model that recovers scalar autonomous ordinary differential equations (ODEs) in symbolic form from time-series data of a single observed solution of the ODE.
Our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing laws of a new observed solution in a few forward passes of the model.
arXiv Detail & Related papers (2022-11-05T07:07:58Z) - ODE Transformer: An Ordinary Differential Equation-Inspired Model for
Sequence Generation [44.101125095045326]
This paper explores a deeper relationship between Transformer and numerical ODE methods.
We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE.
Inspired by this, we design a new architecture, it ODE Transformer', which is easy to implement and efficient to use.
arXiv Detail & Related papers (2022-03-17T08:54:31Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using
Neural Ordinary Differential Equations [1.2183405753834562]
We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver.
We consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations.
We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem.
arXiv Detail & Related papers (2020-10-22T00:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.