I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition
- URL: http://arxiv.org/abs/2303.07624v1
- Date: Tue, 14 Mar 2023 04:47:00 GMT
- Title: I3D: Transformer architectures with input-dependent dynamic depth for
speech recognition
- Authors: Yifan Peng, Jaesong Lee, Shinji Watanabe
- Abstract summary: We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs.
We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
- Score: 41.35563331283372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based end-to-end speech recognition has achieved great success.
However, the large footprint and computational overhead make it difficult to
deploy these models in some real-world applications. Model compression
techniques can reduce the model size and speed up inference, but the compressed
model has a fixed architecture which might be suboptimal. We propose a novel
Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong
performance-efficiency trade-offs. With a similar number of layers at inference
time, I3D-based models outperform the vanilla Transformer and the static pruned
model via iterative layer pruning. We also present interesting analysis on the
gate probabilities and the input-dependency, which helps us better understand
deep encoders.
Related papers
- Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z) - Particulate: Feed-Forward 3D Object Articulation [89.78788418174946]
Particulate is a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure.<n>We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets.<n>During inference, Particulate lifts the network's feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds.
arXiv Detail & Related papers (2025-12-12T18:59:51Z) - ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation [44.75113949778924]
ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines.<n>We show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization.
arXiv Detail & Related papers (2025-10-09T17:57:38Z) - STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer [72.88105562624838]
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem.<n>By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios.<n>Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
arXiv Detail & Related papers (2025-08-14T17:58:05Z) - latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction [48.86083272054711]
latentSplat is a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture.
We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
arXiv Detail & Related papers (2024-03-24T20:48:36Z) - Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space.
We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z) - Knowledge Distillation in Vision Transformers: A Critical Review [6.508088032296086]
Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs)
Model compression has recently attracted considerable research attention as a potential remedy.
This paper discusses various approaches based upon KD for effective compression of ViT models.
arXiv Detail & Related papers (2023-02-04T06:30:57Z) - Real-Time Target Sound Extraction [13.526450617545537]
We present the first neural network model to achieve real-time and streaming target sound extraction.
We propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder.
arXiv Detail & Related papers (2022-11-04T03:51:23Z) - Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery
with Transformers [17.22112222736234]
Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction.
Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use.
We propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO.
arXiv Detail & Related papers (2022-07-27T22:54:09Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics.
Recent neural implicit modeling methods show promising results on synthetic or dense datasets.
But, they perform poorly on real-world data that is sparse and noisy.
This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.