Related papers: Character-level Chinese Backpack Language Models

Character-level Chinese Backpack Language Models

URL: http://arxiv.org/abs/2310.12751v1
Date: Thu, 19 Oct 2023 13:54:57 GMT
Title: Character-level Chinese Backpack Language Models
Authors: Hao Sun, John Hewitt
Abstract summary: We train, evaluate, interpret, and control Backpack language models in character-tokenized Chinese. We find that our (134M parameter) Chinese Backpack language model performs comparably to a (104M parameter) Transformer. We find that complex multi-character meanings are often formed by using the same per-character sense weights consistently across context.
Score: 19.329707412615047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Backpack is a Transformer alternative shown to improve interpretability in English language modeling by decomposing predictions into a weighted sum of token sense components. However, Backpacks' reliance on token-defined meaning raises questions as to their potential for languages other than English, a language for which subword tokenization provides a reasonable approximation for lexical items. In this work, we train, evaluate, interpret, and control Backpack language models in character-tokenized Chinese, in which words are often composed of many characters. We find that our (134M parameter) Chinese Backpack language model performs comparably to a (104M parameter) Transformer, and learns rich character-level meanings that log-additively compose to form word meanings. In SimLex-style lexical semantic evaluations, simple averages of Backpack character senses outperform input embeddings from a Transformer. We find that complex multi-character meanings are often formed by using the same per-character sense weights consistently across context. Exploring interpretability-through control, we show that we can localize a source of gender bias in our Backpacks to specific character senses and intervene to reduce the bias.

Related papers

Backpack Language Models [108.65930795825416]
We present Backpacks, a new neural architecture that marries strong modeling performance with an interface for interpretability and control. We find that, after training, sense vectors specialize, each encoding a different aspect of a word. We present simple algorithms that intervene on sense vectors to perform controllable text generation and debiasing.
arXiv Detail & Related papers (2023-05-26T09:26:23Z)
Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM) We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z)
Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z)
What do tokens know about their characters and how do they know it? [3.8254443661593633]
We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information. We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
arXiv Detail & Related papers (2022-06-06T13:27:26Z)
SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models. We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT. We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z)
LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z)
2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences. Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem. We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.