Understanding Subword Compositionality of Large Language Models
- URL: http://arxiv.org/abs/2508.17953v1
- Date: Mon, 25 Aug 2025 12:16:56 GMT
- Title: Understanding Subword Compositionality of Large Language Models
- Authors: Qiwei Peng, Yekun Chai, Anders Søgaard,
- Abstract summary: Large language models (LLMs) take sequences of subwords as input, requiring them to compose subword representations.<n>We present a comprehensive set of experiments to probe how LLMs compose subword information.
- Score: 42.51978887170929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.
Related papers
- Differential syntactic and semantic encoding in LLMs [49.300174325011426]
We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs)<n>We find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled.
arXiv Detail & Related papers (2026-01-08T09:33:29Z) - Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection [39.748035737067745]
We propose restructuring linguistic representations according to the hierarchical relations within sentences for language-based object detection.<n>A key insight is the necessity of disentangling textual tokens into core components-objects, attributes, and relations ("talk in pieces")-and subsequently aggregating them into hierarchically structured sentence-level representations.<n> Experimental results under the OmniLabel benchmark show a 24% performance improvement, demonstrating the importance of linguistic compositionality.
arXiv Detail & Related papers (2025-09-29T02:14:26Z) - How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding [39.342366994703376]
We introduce a probing framework to analyze how MLLMs process visual and textual inputs across layers.<n>We show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts.
arXiv Detail & Related papers (2025-08-27T21:22:01Z) - DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval [51.89673002051528]
DISRetrieval is a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding.<n>Our studies confirm that discourse structure significantly enhances retrieval effectiveness across different document lengths and query types.
arXiv Detail & Related papers (2025-05-26T14:45:12Z) - Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection [19.610781457283966]
We introduce a novel method for enhancing the compositional understanding of vision-language (VL) models in language-based object detection.
Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains.
We propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets.
arXiv Detail & Related papers (2024-07-21T23:43:24Z) - Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining? [34.609984453754656]
We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment.
Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
arXiv Detail & Related papers (2023-08-24T16:17:40Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Linear Spaces of Meanings: Compositional Structures in Vision-Language
Models [110.00434385712786]
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs)
We first present a framework for understanding compositional structures from a geometric perspective.
We then explain what these structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice.
arXiv Detail & Related papers (2023-02-28T08:11:56Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z) - LSTMs Compose (and Learn) Bottom-Up [18.34617849764921]
Recent work in NLP shows that LSTM language models capture hierarchical structure in language data.
In contrast to existing work, we consider the textitlearning process that leads to their compositional behavior.
We present a related measure of Decompositional Interdependence between word meanings in an LSTM, based on their gate interactions.
arXiv Detail & Related papers (2020-10-06T13:00:32Z) - Word Interdependence Exposes How LSTMs Compose Representations [18.34617849764921]
Recent work in NLP shows that LSTM language models capture compositional structure in language data.
We present a novel measure of interdependence between word meanings in an LSTM, based on their interactions at the internal gates.
arXiv Detail & Related papers (2020-04-27T21:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.