Pre-training Universal Language Representation
- URL: http://arxiv.org/abs/2105.14478v1
- Date: Sun, 30 May 2021 09:29:01 GMT
- Title: Pre-training Universal Language Representation
- Authors: Yian Li, Hai Zhao
- Abstract summary: This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space.
We empirically verify that well designed pre-training scheme may effectively yield universal language representation.
- Score: 46.51685959045527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the well-developed cut-edge representation learning for language,
most language representation models usually focus on specific levels of
linguistic units. This work introduces universal language representation
learning, i.e., embeddings of different levels of linguistic units or text with
quite diverse lengths in a uniform vector space. We propose the training
objective MiSAD that utilizes meaningful n-grams extracted from large unlabeled
corpus by a simple but effective algorithm for pre-trained language models.
Then we empirically verify that well designed pre-training scheme may
effectively yield universal language representation, which will bring great
convenience when handling multiple layers of linguistic objects in a unified
way. Especially, our model achieves the highest accuracy on analogy tasks in
different language levels and significantly improves the performance on
downstream tasks in the GLUE benchmark and a question answering dataset.
Related papers
- Accelerating Multilingual Language Model for Excessively Tokenized Languages [3.5570874721859016]
tokenizers in large language models (LLMs) often fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages.
We introduce a simple yet effective framework to accelerate text generation in such languages.
arXiv Detail & Related papers (2024-01-19T12:26:57Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - BURT: BERT-inspired Universal Representation from Learning Meaningful
Segment [46.51685959045527]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present a universal representation model, BURT, to encode different levels of linguistic unit into the same vector space.
Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage.
arXiv Detail & Related papers (2020-12-28T16:02:28Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - Learning Spoken Language Representations with Neural Lattice Language
Modeling [39.50831917042577]
We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks.
The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency.
arXiv Detail & Related papers (2020-07-06T10:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.