Tokenization Preference for Human and Machine Learning Model: An
Annotation Study
- URL: http://arxiv.org/abs/2304.10813v3
- Date: Fri, 16 Feb 2024 07:55:59 GMT
- Title: Tokenization Preference for Human and Machine Learning Model: An
Annotation Study
- Authors: Tatsuya Hiraoka, Tomoya Iwakura
- Abstract summary: This study examines the relations between preferred tokenization for humans and one for machine-learning (ML) models.
We analyze relations among performance of answers by human and ML model, the appropriateness of tokenization for human, and response time to questions by human.
- Score: 6.399914034380356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Is preferred tokenization for humans also preferred for machine-learning (ML)
models? This study examines the relations between preferred tokenization for
humans (appropriateness and readability) and one for ML models (performance on
an NLP task). The question texts of the Japanese commonsense question-answering
dataset are tokenized with six different tokenizers, and the performances of
human annotators and ML models were compared. Furthermore, we analyze relations
among performance of answers by human and ML model, the appropriateness of
tokenization for human, and response time to questions by human. This study
provides a quantitative investigation result that shows that preferred
tokenizations for humans and ML models are not necessarily always the same. The
result also implies that existing methods using language models for
tokenization could be a good compromise both for human and ML models.
Related papers
- Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility [7.183662547358301]
We examine whether large language models process language similarly to humans.
We find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation.
arXiv Detail & Related papers (2025-03-21T23:25:42Z) - The Multiple Dimensions of Spuriousness in Machine Learning [3.475875199871536]
Learning correlations from data forms the foundation of today's machine learning (ML) and artificial intelligence (AI) research.
While such an approach enables the automatic discovery of patterned relationships within big data corpora, it is susceptible to failure modes when unintended correlations are captured.
This vulnerability has expanded interest in interrogating spuriousness, often critiqued as an impediment to model performance, fairness, and robustness.
arXiv Detail & Related papers (2024-11-07T13:29:32Z) - Reverse-Engineering the Reader [43.26660964074272]
We introduce a novel alignment technique in which we fine-tune a language model to implicitly optimize the parameters of a linear regressor.
Using words as a test case, we evaluate our technique across multiple model sizes and datasets.
We find an inverse relationship between psychometric power and a model's performance on downstream NLP tasks as well as its perplexity on held-out test data.
arXiv Detail & Related papers (2024-10-16T23:05:01Z) - A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors [50.046717886067555]
We show that when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood.
We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
arXiv Detail & Related papers (2024-06-14T17:38:21Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice [4.029252551781513]
We propose a novel way to enhance the utility of Large Language Models as cognitive models.
We show that an LLM pretrained on an ecologically valid arithmetic dataset, predicts human behavior better than many traditional cognitive models.
arXiv Detail & Related papers (2024-05-29T17:37:14Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Longer Fixations, More Computation: Gaze-Guided Recurrent Neural
Networks [12.57650361978445]
Humans read texts at a varying pace, while machine learning models treat each token in the same way.
In this paper, we convert this intuition into a set of novel models with fixation-guided parallel RNNs or layers.
We find that, interestingly, the fixation duration predicted by neural networks bears some resemblance to humans' fixation.
arXiv Detail & Related papers (2023-10-31T21:32:11Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies [80.82897149158853]
Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P.
We propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies.
We show that the resulting models yield better generated text without complex decoding strategies.
arXiv Detail & Related papers (2023-05-26T14:14:51Z) - Quantifying Human Bias and Knowledge to guide ML models during Training [0.0]
We introduce an experimental approach to dealing with skewed datasets by including humans in the training process.
We ask humans to rank the importance of features of the dataset, and through rank aggregation, determine the initial weight bias for the model.
We show that collective human bias can allow ML models to learn insights about the true population instead of the biased sample.
arXiv Detail & Related papers (2022-11-19T20:49:07Z) - To what extent do human explanations of model behavior align with actual
model behavior? [91.67905128825402]
We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions.
We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words.
We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
arXiv Detail & Related papers (2020-12-24T17:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.