Related papers: SuperShaper: Task-Agnostic Super Pre-training of BERT Models with Variable Hidden Dimensions

SuperShaper: Task-Agnostic Super Pre-training of BERT Models with Variable Hidden Dimensions

URL: http://arxiv.org/abs/2110.04711v1
Date: Sun, 10 Oct 2021 05:44:02 GMT
Title: SuperShaper: Task-Agnostic Super Pre-training of BERT Models with Variable Hidden Dimensions
Authors: Vinod Ganesan, Gowtham Ramesh, Pratyush Kumar
Abstract summary: SuperShaper is a task agnostic pre-training approach for NLU models. It simultaneously pre-trains a large number of Transformer models by varying shapes. SuperShaper discovers networks that effectively trade-off accuracy and model size.
Score: 2.8583189395674653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models. Such models need to be deployed on devices across the cloud and the edge with varying resource and accuracy constraints. For a given task, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. We propose SuperShaper, a task agnostic pre-training approach which simultaneously pre-trains a large number of Transformer models by varying shapes, i.e., by varying the hidden dimensions across layers. This is enabled by a backbone network with linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. In spite of its simple design space and efficient implementation, SuperShaper discovers networks that effectively trade-off accuracy and model size: Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, we find two critical advantages of shape as a design variable for Neural Architecture Search (NAS): (a) heuristics of good shapes can be derived and networks found with these heuristics match and even improve on carefully searched networks across a range of parameter counts, and (b) the latency of networks across multiple CPUs and GPUs are insensitive to the shape and thus enable device-agnostic search. In summary, SuperShaper radically simplifies NAS for language models and discovers networks that generalize across tasks, parameter constraints, and devices.

Related papers

Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z)
On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL) In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh. We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z)
SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures. The proposed approach can be applied to general backbones like PointNet and DGCNN. Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z)
Searching for Efficient Multi-Stage Vision Transformers [42.0565109812926]
Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks. ViT-ResNAS is an efficient multi-stage ViT architecture designed with neural architecture search (NAS)
arXiv Detail & Related papers (2021-09-01T22:37:56Z)
Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network. PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)
FNA++: Fast Network Adaptation via Parameter Remapping and Architecture Search [35.61441231491448]
We propose a Fast Network Adaptation (FNA++) method, which can adapt both the architecture and parameters of a seed network. In our experiments, we apply FNA++ on MobileNetV2 to obtain new networks for semantic segmentation, object detection, and human pose estimation. The total computation cost of FNA++ is significantly less than SOTA segmentation and detection NAS approaches.
arXiv Detail & Related papers (2020-06-21T10:03:34Z)
Ensembled sparse-input hierarchical networks for high-dimensional datasets [8.629912408966145]
We show that dense neural networks can be a practical data analysis tool in settings with small sample sizes. A proposed method appropriately prunes the network structure by tuning only two L1-penalty parameters. On a collection of real-world datasets with different sizes, EASIER-net selected network architectures in a data-adaptive manner and achieved higher prediction accuracy than off-the-shelf methods on average.
arXiv Detail & Related papers (2020-05-11T02:08:53Z)
Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [101.84651388520584]
This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs. Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T15:51:00Z)
DHP: Differentiable Meta Pruning via HyperNetworks [158.69345612783198]
This paper introduces a differentiable pruning method via hypernetworks for automatic network pruning. Latent vectors control the output channels of the convolutional layers in the backbone network and act as a handle for the pruning of the layers. Experiments are conducted on various networks for image classification, single image super-resolution, and denoising.
arXiv Detail & Related papers (2020-03-30T17:59:18Z)
Fast Neural Network Adaptation via Parameter Remapping and Architecture Search [35.61441231491448]
Deep neural networks achieve remarkable performance in many computer vision tasks. Most state-of-the-art (SOTA) semantic segmentation and object detection approaches reuse neural network architectures designed for image classification as the backbone. One major challenge though, is that ImageNet pre-training of the search space representation incurs huge computational cost. In this paper, we propose a Fast Neural Network Adaptation (FNA) method, which can adapt both the architecture and parameters of a seed network.
arXiv Detail & Related papers (2020-01-08T13:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.