From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
- URL: http://arxiv.org/abs/2502.12448v1
- Date: Tue, 18 Feb 2025 02:29:51 GMT
- Title: From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
- Authors: Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, Kun Gai,
- Abstract summary: discrete tokenizers have emerged as indispensable components in modern machine learning systems.
This paper provides a systematic review of the design principles, applications, and challenges of discrete tokenizers.
- Score: 34.353945607932204
- License:
- Abstract: Discrete tokenizers have emerged as indispensable components in modern machine learning systems, particularly within the context of autoregressive modeling and large language models (LLMs). These tokenizers serve as the critical interface that transforms raw, unstructured data from diverse modalities into discrete tokens, enabling LLMs to operate effectively across a wide range of tasks. Despite their central role in generation, comprehension, and recommendation systems, a comprehensive survey dedicated to discrete tokenizers remains conspicuously absent in the literature. This paper addresses this gap by providing a systematic review of the design principles, applications, and challenges of discrete tokenizers. We begin by dissecting the sub-modules of tokenizers and systematically demonstrate their internal mechanisms to provide a comprehensive understanding of their functionality and design. Building on this foundation, we synthesize state-of-the-art methods, categorizing them into multimodal generation and comprehension tasks, and semantic tokens for personalized recommendations. Furthermore, we critically analyze the limitations of existing tokenizers and outline promising directions for future research. By presenting a unified framework for understanding discrete tokenizers, this survey aims to guide researchers and practitioners in addressing open challenges and advancing the field, ultimately contributing to the development of more robust and versatile AI systems.
Related papers
- Mechanistic understanding and validation of large AI models with SemanticLens [13.712668314238082]
Unlike human-engineered systems such as aeroplanes, the inner workings of AI models remain largely opaque.
This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components.
arXiv Detail & Related papers (2025-01-09T17:47:34Z) - Interpreting token compositionality in LLMs: A robustness analysis [10.777646083061395]
Constituent-Aware Pooling (CAP) is a methodology designed to analyse how large language models process linguistic structures.
CAP intervenes in model activations through constituent-based pooling at various model levels.
Our findings highlight fundamental limitations in current transformer architectures regarding compositional semantics processing and model interpretability.
arXiv Detail & Related papers (2024-10-16T18:10:50Z) - Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models.
The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z) - Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence.
Recent trends demonstrate the potential homogeneity of these two fields.
We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z) - Interpretable and Explainable Machine Learning Methods for Predictive
Process Monitoring: A Systematic Literature Review [1.3812010983144802]
This paper presents a systematic review on the explainability and interpretability of machine learning (ML) models within the context of predictive process mining.
We provide a comprehensive overview of the current methodologies and their applications across various application domains.
Our findings aim to equip researchers and practitioners with a deeper understanding of how to develop and implement more trustworthy, transparent, and effective intelligent systems for process analytics.
arXiv Detail & Related papers (2023-12-29T12:43:43Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Towards a Responsible AI Metrics Catalogue: A Collection of Metrics for
AI Accountability [28.67753149592534]
This study bridges the accountability gap by introducing our effort towards a comprehensive metrics catalogue.
Our catalogue delineates process metrics that underpin procedural integrity, resource metrics that provide necessary tools and frameworks, and product metrics that reflect the outputs of AI systems.
arXiv Detail & Related papers (2023-11-22T04:43:16Z) - Recommender Systems in the Era of Large Language Models (LLMs) [62.0129013439038]
Large Language Models (LLMs) have revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI)
We conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting.
arXiv Detail & Related papers (2023-07-05T06:03:40Z) - FACT: Learning Governing Abstractions Behind Integer Sequences [7.895232155155041]
We introduce a novel view on the learning of concepts admitting complete finitary descriptions.
We lay down a set of benchmarking tasks aimed at conceptual understanding by machine learning models.
To further aid research in knowledge representation and reasoning, we present FACT, the Finitary Abstraction Toolkit.
arXiv Detail & Related papers (2022-09-20T08:20:03Z) - What and How of Machine Learning Transparency: Building Bespoke
Explainability Tools with Interoperable Algorithmic Components [77.87794937143511]
This paper introduces a collection of hands-on training materials for explaining data-driven predictive models.
These resources cover the three core building blocks of this technique: interpretable representation composition, data sampling and explanation generation.
arXiv Detail & Related papers (2022-09-08T13:33:25Z) - Self-organizing Democratized Learning: Towards Large-scale Distributed
Learning Systems [71.14339738190202]
democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems.
Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper.
The proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms.
arXiv Detail & Related papers (2020-07-07T08:34:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.