Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
- URL: http://arxiv.org/abs/2505.04637v1
- Date: Sat, 03 May 2025 09:14:24 GMT
- Title: Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
- Authors: Dongxing Yu,
- Abstract summary: This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies.<n>We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.
Related papers
- Large Language Models as Psychological Simulators: A Methodological Guide [0.0]
This article provides a framework for using large language models as psychological simulators across two primary applications.<n>For simulation, we present methods for developing psychologically grounded personas that move beyond demographic categories.<n>We address overarching challenges including prompt sensitivity, temporal limitations from training data cutoffs, and ethical considerations that extend beyond traditional human subjects review.
arXiv Detail & Related papers (2025-06-20T02:45:23Z) - Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing [125.75923987618977]
We propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model.<n>It is a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them.<n>It provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states.
arXiv Detail & Related papers (2025-06-03T14:44:48Z) - Modèles de Substitution pour les Modèles à base d'Agents : Enjeux, Méthodes et Applications [0.0]
Agent-based models (ABM) are widely used to study emergent phenomena arising from local interactions.<n>The complexity of ABM limits their feasibility for real-time decision-making and large-scale scenario analysis.<n>To address these limitations, surrogate models offer an efficient alternative by learning approximations from sparse simulation data.
arXiv Detail & Related papers (2025-05-17T08:55:33Z) - Artificial Behavior Intelligence: Technology, Challenges, and Future Directions [1.5237607855633524]
This paper defines the technical framework of Artificial Behavior Intelligence (ABI)<n>ABI comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, and contextual cues.<n>It details the essential components of ABI, including pose estimation, face and emotion recognition, sequential behavior analysis, and context-aware modeling.
arXiv Detail & Related papers (2025-05-06T08:45:44Z) - On the Reasoning Capacity of AI Models and How to Quantify It [0.0]
Large Language Models (LLMs) have intensified the debate surrounding the fundamental nature of their reasoning capabilities.<n>While achieving high performance on benchmarks such as GPQA and MMLU, these models exhibit limitations in more complex reasoning tasks.<n>We propose a novel phenomenological approach that goes beyond traditional accuracy metrics to probe the underlying mechanisms of model behavior.
arXiv Detail & Related papers (2025-01-23T16:58:18Z) - MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.<n>MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.<n>Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models.
The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z) - Discrete, compositional, and symbolic representations through attractor dynamics [51.20712945239422]
We introduce a novel neural systems model that integrates attractor dynamics with symbolic representations to model cognitive processes akin to the probabilistic language of thought (PLoT)
Our model segments the continuous representational space into discrete basins, with attractor states corresponding to symbolic sequences, that reflect the semanticity and compositionality characteristic of symbolic systems through unsupervised learning, rather than relying on pre-defined primitives.
This approach establishes a unified framework that integrates both symbolic and sub-symbolic processing through neural dynamics, a neuroplausible substrate with proven expressivity in AI, offering a more comprehensive model that mirrors the complex duality of cognitive operations
arXiv Detail & Related papers (2023-10-03T05:40:56Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - A Meta-Bayesian Model of Intentional Visual Search [0.0]
We propose a computational model of visual search that incorporates Bayesian interpretations of the neural mechanisms that underlie categorical perception and saccade planning.
To enable meaningful comparisons between simulated and human behaviours, we employ a gaze-contingent paradigm that required participants to classify occluded MNIST digits through a window that followed their gaze.
Our model is able to recapitulate human behavioural metrics such as classification accuracy while retaining a high degree of interpretability, which we demonstrate by recovering subject-specific parameters from observed human behaviour.
arXiv Detail & Related papers (2020-06-05T16:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.