I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition
- URL: http://arxiv.org/abs/2303.07624v1
- Date: Tue, 14 Mar 2023 04:47:00 GMT
- Title: I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition
- Authors: Yifan Peng, Jaesong Lee, Shinji Watanabe
- Abstract summary: We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs.
We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
- Score: 41.35563331283372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based end-to-end speech recognition has achieved great success.
However, the large footprint and computational overhead make it difficult to
deploy these models in some real-world applications. Model compression
techniques can reduce the model size and speed up inference, but the compressed
model has a fixed architecture which might be suboptimal. We propose a novel
Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong
performance-efficiency trade-offs. With a similar number of layers at inference
time, I3D-based models outperform the vanilla Transformer and the static pruned
model via iterative layer pruning. We also present interesting analysis on the
gate probabilities and the input-dependency, which helps us better understand
deep encoders.
Related papers
- latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction [48.86083272054711]
latentSplat is a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture.
We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
arXiv Detail & Related papers (2024-03-24T20:48:36Z) - Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space.
We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z) - Knowledge Distillation in Vision Transformers: A Critical Review [6.508088032296086]
Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs)
Model compression has recently attracted considerable research attention as a potential remedy.
This paper discusses various approaches based upon KD for effective compression of ViT models.
arXiv Detail & Related papers (2023-02-04T06:30:57Z) - Real-Time Target Sound Extraction [13.526450617545537]
We present the first neural network model to achieve real-time and streaming target sound extraction.
We propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder.
arXiv Detail & Related papers (2022-11-04T03:51:23Z) - Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery
with Transformers [17.22112222736234]
Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction.
Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use.
We propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO.
arXiv Detail & Related papers (2022-07-27T22:54:09Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics.
Recent neural implicit modeling methods show promising results on synthetic or dense datasets.
But, they perform poorly on real-world data that is sparse and noisy.
This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.