Named Entity Extraction with Finite State Transducers
- URL: http://arxiv.org/abs/2006.11548v1
- Date: Sat, 20 Jun 2020 11:09:04 GMT
- Title: Named Entity Extraction with Finite State Transducers
- Authors: Diego Alexander Hu\'erfano Villalba and Elizabeth Le\'on Guzm\'an
- Abstract summary: We describe a named entity tagging system that requires minimal linguistic knowledge.
The system is based on the ideas of the Brill's tagger which makes it really simple.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a named entity tagging system that requires minimal linguistic
knowledge and can be applied to more target languages without substantial
changes. The system is based on the ideas of the Brill's tagger which makes it
really simple. Using supervised machine learning, we construct a series of
automatons (or transducers) in order to tag a given text. The final model is
composed entirely of automatons and it requires a lineal time for tagging. It
was tested with the Spanish data set provided in the CoNLL-$2002$ attaining an
overall $F_{\beta = 1}$ measure of $60\%.$ Also, we present an algorithm for
the construction of the final transducer used to encode all the learned
contextual rules.
Related papers
- Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens [138.36729703589512]
We show that $n$-gram language models are still relevant in this era of neural large language models (LLMs)
This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens.
Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $infty$-gram LM with backoff.
arXiv Detail & Related papers (2024-01-30T19:03:49Z) - Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math [52.66190891388847]
We introduce textscMathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
Our meticulous data collection and processing efforts included a complex suite of preprocessing.
We hope our textscMathPile can help to enhance the mathematical reasoning abilities of language models.
arXiv Detail & Related papers (2023-12-28T16:55:40Z) - Introducing Rhetorical Parallelism Detection: A New Task with Datasets,
Metrics, and Baselines [8.405938712823565]
parallelism$ is the juxtaposition of phrases which have the same sequence of linguistic features.
Despite the ubiquity of parallelism, the field of natural language processing has seldom investigated it.
We construct a formal definition of it; we provide one new Latin dataset and one adapted Chinese dataset for it; we establish a family of metrics to evaluate performance on it.
arXiv Detail & Related papers (2023-11-30T15:24:57Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - On the Intersection of Context-Free and Regular Languages [71.61206349427509]
We generalize the Bar-Hillel construction to handle finite-state automatons with $varepsilon$-arcs.
We prove that our construction leads to a grammar that encodes the structure of both the input automaton and grammar while retaining the size of the original construction.
arXiv Detail & Related papers (2022-09-14T17:49:06Z) - Automatic question generation based on sentence structure analysis using
machine learning approach [0.0]
This article introduces our framework for generating factual questions from unstructured text in the English language.
It uses a combination of traditional linguistic approaches based on sentence patterns with several machine learning methods.
The framework also includes a question evaluation module which estimates the quality of generated questions.
arXiv Detail & Related papers (2022-05-25T14:35:29Z) - A Case Study of Spanish Text Transformations for Twitter Sentiment
Analysis [1.9694608733361543]
Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness.
New forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors.
arXiv Detail & Related papers (2021-06-03T17:24:31Z) - Breaking Writer's Block: Low-cost Fine-tuning of Natural Language
Generation Models [62.997667081978825]
We describe a system that fine-tunes a natural language generation model for the problem of solving Writer's Block.
The proposed fine-tuning obtains excellent results, even with a small number of epochs and a total cost of USD 150.
arXiv Detail & Related papers (2020-12-19T11:19:11Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - DeCLUTR: Deep Contrastive Learning for Unsupervised Textual
Representations [4.36561468436181]
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.
Our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.
Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
arXiv Detail & Related papers (2020-06-05T20:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.