Related papers: CUTE: Measuring LLMs' Understanding of Their Tokens

CUTE: Measuring LLMs' Understanding of Their Tokens

URL: http://arxiv.org/abs/2409.15452v2
Date: Wed, 2 Oct 2024 14:35:40 GMT
Title: CUTE: Measuring LLMs' Understanding of Their Tokens
Authors: Lukas Edman, Helmut Schmid, Alexander Fraser,
Abstract summary: Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. This raises the question: To what extent can LLMs learn orthographic information? We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
Score: 54.70665106141121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

Related papers

Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters [25.430820735194768]
Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks.<n>We investigate how LLMs internally represent and utilize character-level information during the spelling-out process.
arXiv Detail & Related papers (2025-06-12T12:27:41Z)
EXECUTE: A Multilingual Benchmark for LLM Token Understanding [54.70665106141121]
Tests across multiple languages reveal that challenges in other languages are not always on the character level as in English.<n>We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.
arXiv Detail & Related papers (2025-05-23T11:56:48Z)
Evaluating LLMs for Visualization Tasks [0.0]
We showcase the capabilities of different popular Large Language Models (LLMs) to generate code for visualization based on simple prompts.<n>We analyze the power of LLMs to understand some common visualizations by answering simple questions.
arXiv Detail & Related papers (2025-04-10T10:12:30Z)
Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues [3.2162648244439684]
We develop a framework for investigating how effective Large Language Models are at measuring and scoring empathy of responses in dialogues. Our strategy is to approximate the performance of state-of-the-art and fine-tuned LLMs with explicit and explainable features. Our results show that when only using embeddings, it is possible to achieve performance close to that of generic LLMs.
arXiv Detail & Related papers (2024-12-28T20:37:57Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE) DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
On Unsupervised Prompt Learning for Classification with Black-box Language Models [71.60563181678323]
Large language models (LLMs) have achieved impressive success in text-formatted learning problems. LLMs can label datasets with even better quality than skilled human annotators. In this paper, we propose unsupervised prompt learning for classification with black-box LLMs.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
LLMs' Understanding of Natural Language Revealed [0.0]
Large language models (LLMs) are the result of a massive experiment in bottom-up, data-driven reverse engineering of language at scale. We will focus on testing LLMs for their language understanding capabilities, their supposed forte.
arXiv Detail & Related papers (2024-07-29T01:21:11Z)
Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens [21.61634020256455]
Transformer-based large language models (LLMs) suffer a performance degradation when modeling long-term contexts. We propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks.
arXiv Detail & Related papers (2024-06-16T15:50:10Z)
Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations. This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z)
When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
Enabling Large Language Models to Learn from Rules [99.16680531261987]
We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules. We propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules. Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability.
arXiv Detail & Related papers (2023-11-15T11:42:41Z)
Pre-training LLMs using human-like development data corpus [3.5757761767474876]
We pre-train and evaluate Large Language Models (LLMs) on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task.
arXiv Detail & Related papers (2023-11-08T13:13:23Z)
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks. This paper presents the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.