Learning Multiscale Transformer Models for Sequence Generation
- URL: http://arxiv.org/abs/2206.09337v1
- Date: Sun, 19 Jun 2022 07:28:54 GMT
- Title: Learning Multiscale Transformer Models for Sequence Generation
- Authors: Bei Li, Tong Zheng, Yi Jing, Chengbo Jiao, Tong Xiao and Jingbo Zhu
- Abstract summary: We build a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge.
Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.
- Score: 33.73729074207944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiscale feature hierarchies have been witnessed the success in the
computer vision area. This further motivates researchers to design multiscale
Transformer for natural language processing, mostly based on the self-attention
mechanism. For example, restricting the receptive field across heads or
extracting local fine-grained features via convolutions. However, most of
existing works directly modeled local features but ignored the word-boundary
information. This results in redundant and ambiguous attention distributions,
which lacks of interpretability. In this work, we define those scales in
different linguistic units, including sub-words, words and phrases. We built a
multiscale Transformer model by establishing relationships among scales based
on word-boundary information and phrase-level prior knowledge. The proposed
\textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely
\textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it
yielded consistent performance gains over the strong baseline on several test
sets without sacrificing the efficiency.
Related papers
- Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies [12.843274390224853]
Real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing.
We propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training.
Our approach re-ranks beams during decoding by combining scores at the word level, using multimodals to predict when a word is completed.
arXiv Detail & Related papers (2024-08-21T04:20:55Z) - Investigating semantic subspaces of Transformer sentence embeddings
through linear structural probing [2.5002227227256864]
We present experiments with semantic structural probing, a method for studying sentence-level representations.
We apply our method to language models from different families (encoder-only, decoder-only, encoder-decoder) and of different sizes in the context of two tasks.
We find that model families differ substantially in their performance and layer dynamics, but that the results are largely model-size invariant.
arXiv Detail & Related papers (2023-10-18T12:32:07Z) - MGDoc: Pre-training with Multi-granular Hierarchy for Document Image
Understanding [53.03978356918377]
spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks.
Existing methods learn features from either word-level or region-level but fail to consider both simultaneously.
We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
arXiv Detail & Related papers (2022-11-27T22:47:37Z) - Pre-Training a Graph Recurrent Network for Language Representation [34.4554387894105]
We consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications.
We find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
arXiv Detail & Related papers (2022-09-08T14:12:15Z) - Multilingual Transformer Encoders: a Word-Level Task-Agnostic Evaluation [0.6882042556551609]
Some Transformer-based models can perform cross-lingual transfer learning.
We propose a word-level task-agnostic method to evaluate the alignment of contextualized representations built by such models.
arXiv Detail & Related papers (2022-07-19T05:23:18Z) - Exploring Dimensionality Reduction Techniques in Multilingual
Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers.
It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Multiple Word Embeddings for Increased Diversity of Representation [15.279850826041066]
We show a technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time.
We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity is the driving force of why this technique works.
arXiv Detail & Related papers (2020-09-30T02:33:09Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.