Fugu-MT 論文翻訳(概要): See the Text: From Tokenization to Visual Reading

論文の概要: See the Text: From Tokenization to Visual Reading

arxiv url: http://arxiv.org/abs/2510.18840v1
Date: Tue, 21 Oct 2025 17:34:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:14.03635
Title: See the Text: From Tokenization to Visual Reading
Title（参考訳）: テキスト:TokenizationからVisual Readingへ
Authors: Ling Xing, Alex Jinpeng Wang, Rui Yan, Hongyu Qu, Zechao Li, Jinhui Tang,
Abstract要約: SeeTokはテキストを画像(ビジュアルテキスト)としてレンダリングし、事前訓練されたマルチモーダル計算を利用して解釈する。 3つの異なる言語タスクの中で、SeeeTokはサブワードトークンをマッチまたはオーバーし、トークンを4.43倍少なくし、FLOPを70.5%削減する。 SeeTokは、象徴的なトークン化から人間のような視覚的な読み方へとシフトし、より自然で認知的にインスパイアされた言語モデルへと一歩前進する。
参考スコア（独自算出の注目度）: 63.10220471118435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
Abstract（参考訳）: 人々はテキストを見ます。人間は、字型、歪んだフォント、および様々なスクリプトを効果的に扱えるように、その形、レイアウト、パターンなどを含む、単語を視覚オブジェクトとして認識して読み取る。しかし、現代の大きな言語モデル(LLM)はサブワードのトークン化に依存しており、テキストを固定語彙から断片化する。高リソース言語には有効であるが、このアプローチは低リソース言語を過剰に分離し、長い言語的に意味のないシーケンスと膨らませる計算をもたらす。この研究では、この定着したパラダイムに挑戦し、ビジョン中心の代替手段に向かっています。提案手法であるSeeeTokは,テキストを画像(視覚テキスト)としてレンダリングし,事前学習したマルチモーダルLCMを用いて解釈し,大規模マルチモーダルトレーニングから学習した強力なOCRとテキストビジョンアライメント能力を再利用する。 3つの異なる言語タスクの中で、SeeeTokは4.43倍のトークンを減らし、FLOPを70.5%減らし、言語間の一般化、タイポグラフィーノイズに対する堅牢性、言語階層を増す。 SeeTokは、象徴的なトークン化から人間のような視覚的な読み方へとシフトし、より自然で認知的にインスパイアされた言語モデルへと一歩前進する。

論文の概要: See the Text: From Tokenization to Visual Reading

関連論文リスト