Related papers: Accelerating Natural Language Understanding in Task-Oriented Dialog

Accelerating Natural Language Understanding in Task-Oriented Dialog

URL: http://arxiv.org/abs/2006.03701v1
Date: Fri, 5 Jun 2020 21:36:33 GMT
Title: Accelerating Natural Language Understanding in Task-Oriented Dialog
Authors: Ojas Ahuja and Shrey Desai
Abstract summary: We show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters. We also perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.
Score: 6.757982879080109
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task-oriented dialog models typically leverage complex neural architectures and large-scale, pre-trained Transformers to achieve state-of-the-art performance on popular natural language understanding benchmarks. However, these models frequently have in excess of tens of millions of parameters, making them impossible to deploy on-device where resource-efficiency is a major concern. In this work, we show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters. Moreover, we perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.

Related papers

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
Parameter-Efficient Transformer Embeddings [0.0]
We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs.<n>We train standard transformers and our architecture on natural language inference tasks.<n>Our results demonstrate that the proposed method competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout.
arXiv Detail & Related papers (2025-05-04T21:47:18Z)
Towards a Real-Time Simulation of Elastoplastic Deformation Using Multi-Task Neural Networks [0.0]
This study introduces a surrogate modeling framework merging proper decomposition, long short-term memory networks, and multi-task learning, to accurately predict elastoplastic deformations in real-time. The framework achieves a mean absolute error below 0.40% across various state variables. In our use cases, a pre-trained multi-task model can effectively train additional variables with as few as 20 samples, demonstrating its deep understanding of complex scenarios.
arXiv Detail & Related papers (2024-11-08T14:04:17Z)
Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI. As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z)
Structural Pruning of Pre-trained Language Models via Neural Architecture Search [7.833790713816726]
Pre-trained language models (PLM) mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency.
arXiv Detail & Related papers (2024-05-03T17:34:57Z)
Fast DistilBERT on CPUs [13.29188219884869]
Transformer-based language models have become the standard approach to solving natural language processing tasks. Industry adoption usually requires the maximum throughput to comply with certain latency constraints. We propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators.
arXiv Detail & Related papers (2022-10-27T07:22:50Z)
Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z)
HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning [14.412066456583917]
We propose a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal. We extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
arXiv Detail & Related papers (2022-01-11T20:15:35Z)
AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures. We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS. Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z)
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning [52.624194343095304]
We argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions. We empirically show that common pre-trained models have a very low intrinsic dimension.
arXiv Detail & Related papers (2020-12-22T07:42:30Z)
Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices. Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
Simplified Self-Attention for Transformer-based End-to-End Speech Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.