NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition
- URL: http://arxiv.org/abs/2205.05832v1
- Date: Thu, 12 May 2022 01:55:37 GMT
- Title: NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition
- Authors: Shuang Wu, Xiaoning Song, Zhenhua Feng, Xiaojun Wu
- Abstract summary: We advocate a novel lexical enhancement method, InterFormer, that effectively reduces the amount of computational and memory costs.
Compared with FLAT, it reduces unnecessary attention calculations in "word-character" and "word-word"
This reduces the memory usage by about 50% and can use more extensive lexicons or higher batches for network training.
- Score: 39.308634515653914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, FLAT has achieved great success in Chinese Named Entity Recognition
(NER). This method achieves lexical enhancement by constructing a flat lattice,
which mitigates the difficulties posed by blurred word boundaries and the lack
of word semantics. To this end, FLAT uses the position information of the
starting and ending characters to connect the matching words. However, this
method is likely to match more words when dealing with long texts, resulting in
very long input sequences. Therefore, it increases the memory used by
self-attention and computational costs. To deal with this issue, we advocate a
novel lexical enhancement method, InterFormer, that effectively reduces the
amount of computational and memory costs by constructing the non-flat-lattice.
Furthermore, we implement a complete model, namely NFLAT, for the Chinese NER
task. NFLAT decouples lexicon fusion and context feature encoding. Compared
with FLAT, it reduces unnecessary attention calculations in "word-character"
and "word-word". This reduces the memory usage by about 50\% and can use more
extensive lexicons or higher batches for network training. The experimental
results obtained on several well-known benchmarks demonstrate the superiority
of the proposed method over the state-of-the-art character-word hybrid models.
Related papers
- Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Learn Your Tokens: Word-Pooled Tokenization for Language Modeling [11.40976202290724]
Language models typically tokenize text into subwords, using a deterministic, hand-engineered of combining tokens into longer strings.
Recent attempts to compress and limit context lengths with fixed size convolutions is helpful but completely ignores the word boundary.
This paper considers an alternative 'learn your word' scheme which utilizes the word boundary to pool bytes/characters into word representations.
arXiv Detail & Related papers (2023-10-17T23:34:39Z) - Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution.
The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z) - Semantic Tokenizer for Enhanced Natural Language Processing [32.605667552915854]
We present a novel tokenizer that uses semantics to drive vocabulary construction.
The tokenizer more than doubles the number of wordforms represented in the vocabulary.
arXiv Detail & Related papers (2023-04-24T19:33:41Z) - Efficient CNN with uncorrelated Bag of Features pooling [98.78384185493624]
Bag of Features (BoF) has been recently proposed to reduce the complexity of convolution layers.
We propose an approach that builds on top of BoF pooling to boost its efficiency by ensuring that the items of the learned dictionary are non-redundant.
The proposed strategy yields an efficient variant of BoF and further boosts its performance, without any additional parameters.
arXiv Detail & Related papers (2022-09-22T09:00:30Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Char2Subword: Extending the Subword Embedding Space Using Robust
Character Compositionality [24.80654159288458]
We propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT.
Our module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation.
We show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
arXiv Detail & Related papers (2020-10-24T01:08:28Z) - Improving Chinese Segmentation-free Word Embedding With Unsupervised
Association Measure [3.9435648520559177]
segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI)
The proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts.
arXiv Detail & Related papers (2020-07-05T13:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.