An Information Extraction Study: Take In Mind the Tokenization!
- URL: http://arxiv.org/abs/2303.15100v2
- Date: Sat, 1 Apr 2023 19:04:58 GMT
- Title: An Information Extraction Study: Take In Mind the Tokenization!
- Authors: Christos Theodoropoulos, Marie-Francine Moens
- Abstract summary: We study the impact of tokenization when extracting information from documents.
We present a comparative study and analysis of subword-based and character-based models.
The main outcome is twofold: tokenization patterns can introduce inductive bias that results in state-of-the-art performance.
- Score: 18.20319269401045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current research on the advantages and trade-offs of using characters,
instead of tokenized text, as input for deep learning models, has evolved
substantially. New token-free models remove the traditional tokenization step;
however, their efficiency remains unclear. Moreover, the effect of tokenization
is relatively unexplored in sequence tagging tasks. To this end, we investigate
the impact of tokenization when extracting information from documents and
present a comparative study and analysis of subword-based and character-based
models. Specifically, we study Information Extraction (IE) from biomedical
texts. The main outcome is twofold: tokenization patterns can introduce
inductive bias that results in state-of-the-art performance, and the
character-based models produce promising results; thus, transitioning to
token-free IE models is feasible.
Related papers
- TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior [30.782240245074433]
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs)<n>TokSuite is a collection of models and a benchmark that supports research into tokenization's influence on LMs.
arXiv Detail & Related papers (2025-12-23T20:43:06Z) - Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [53.398270878295754]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability [1.8274323268621635]
Real Explainer (RealExp) is an interpretability method that decouples the Shapley Value into individual feature importance and feature correlation importance.
RealExp enhances interpretability by precisely quantifying both individual feature contributions and their interactions.
arXiv Detail & Related papers (2024-12-02T10:50:50Z) - AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
We introduce AutoElicit to extract knowledge from large language models and construct priors for predictive models.
We show these priors are informative and can be refined using natural language.
We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning.
arXiv Detail & Related papers (2024-11-26T10:13:39Z) - Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights [0.412484724941528]
We introduce a simple yet effective knowledge distillation method to improve the performance of small language models.
Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process.
This method has proven to be effective, as demonstrated by testing it on four diverse datasets.
arXiv Detail & Related papers (2024-09-19T09:09:53Z) - Common Steps in Machine Learning Might Hinder The Explainability Aims in Medicine [0.0]
This paper discusses the steps of the data preprocessing in machine learning and their impacts on the explainability and interpretability of the model.
It is found the steps improve the accuracy of the model, but they might hinder the explainability of the model if they are not carefully considered especially in medicine.
arXiv Detail & Related papers (2024-08-30T12:09:14Z) - Topic Modelling: Going Beyond Token Outputs [3.072340427031969]
This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens.
To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output.
arXiv Detail & Related papers (2024-01-16T16:05:54Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Explaining Explainability: Towards Deeper Actionable Insights into Deep
Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level.
We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z) - Metric Tools for Sensitivity Analysis with Applications to Neural
Networks [0.0]
Explainable Artificial Intelligence (XAI) aims to provide interpretations for predictions made by Machine Learning models.
In this paper, a theoretical framework is proposed to study sensitivities of ML models using metric techniques.
A complete family of new quantitative metrics called $alpha$-curves is extracted.
arXiv Detail & Related papers (2023-05-03T18:10:21Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner.
We study the entropy, or uncertainty, of the model's token-level predictions.
We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.