Maps Search Misspelling Detection Leveraging Domain-Augmented Contextual
Representations
- URL: http://arxiv.org/abs/2108.06842v1
- Date: Sun, 15 Aug 2021 23:59:12 GMT
- Title: Maps Search Misspelling Detection Leveraging Domain-Augmented Contextual
Representations
- Authors: Yutong Li
- Abstract summary: Building an independent misspelling detector and serve it before correction can bring multiple benefits to speller and other search components.
With rapid development of deep learning and substantial advancement in contextual representation learning such as BERTology, building a decent misspelling detector without having to rely on hand-crafted features associated with noisy-channel architecture becomes more-than-ever accessible.
In this paper we design 4 stages of models for misspeling detection ranging from the most basic LSTM to single-domain augmented fine-tuned BERT.
- Score: 4.619541348328937
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Building an independent misspelling detector and serve it before correction
can bring multiple benefits to speller and other search components, which is
particularly true for the most commonly deployed noisy-channel based speller
systems. With rapid development of deep learning and substantial advancement in
contextual representation learning such as BERTology, building a decent
misspelling detector without having to rely on hand-crafted features associated
with noisy-channel architecture becomes more-than-ever accessible. However
BERTolgy models are trained with natural language corpus but Maps Search is
highly domain specific, would BERTology continue its success. In this paper we
design 4 stages of models for misspeling detection ranging from the most basic
LSTM to single-domain augmented fine-tuned BERT. We found for Maps Search in
our case, other advanced BERTology family model such as RoBERTa does not
necessarily outperform BERT, and a classic cross-domain fine-tuned full BERT
even underperforms a smaller single-domain fine-tuned BERT. We share more
findings through comprehensive modeling experiments and analysis, we also
briefly cover the data generation algorithm breakthrough.
Related papers
- Ontology Enhanced Claim Detection [1.0878040851637998]
We propose an ontology enhanced model for sentence based claim detection.
We fused knowledge base with BERT sentence embeddings to perform claim detection for the ClaimBuster and the NewsClaims datasets.
Our approach showed the best results with these small-sized unbalanced datasets, compared to other statistical and neural machine learning models.
arXiv Detail & Related papers (2024-02-19T16:50:58Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks.
We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z) - Diagnosing BERT with Retrieval Heuristics [8.299945169799793]
"vanilla BERT" has been shown to outperform existing retrieval algorithms by a wide margin.
In this paper, we employ the recently proposed axiomatic dataset analysis technique.
We find BERT, when applied to a recently released large-scale web corpus with ad-hoc topics, to emphnot adhere to any of the explored axioms.
arXiv Detail & Related papers (2022-01-12T13:11:17Z) - BERTMap: A BERT-based Ontology Alignment System [24.684912604644865]
BERTMap can support both unsupervised and semi-supervised settings.
BERTMap can often perform better than the leading systems LogMap and AML.
arXiv Detail & Related papers (2021-12-05T21:08:44Z) - Finding the Winning Ticket of BERT for Binary Text Classification via
Adaptive Layer Truncation before Fine-tuning [7.797987384189306]
We construct a series of BERT-based models with different size and compare their predictions on 8 binary classification tasks.
The results show there truly exist smaller sub-networks performing better than the full model.
arXiv Detail & Related papers (2021-11-22T02:22:47Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - Enhancing the Generalization for Intent Classification and Out-of-Domain
Detection in SLU [70.44344060176952]
Intent classification is a major task in spoken language understanding (SLU)
Recent works have shown that using extra data and labels can improve the OOD detection performance.
This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection.
arXiv Detail & Related papers (2021-06-28T08:27:38Z) - Can BERT Dig It? -- Named Entity Recognition for Information Retrieval
in the Archaeology Domain [3.928604516640069]
ArcheoBERTje is a BERT model pre-trained on Dutch archaeological texts.
We analyse the differences between the vocabulary and output of the BERT models on the full collection.
arXiv Detail & Related papers (2021-06-14T20:26:19Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.